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Abstract 

Wireless Sensor Network (WSN) consists of large 
number of low-cost, resource-constrained sensor nodes. 
The constraints of the wireless sensor node is their 
characteristics which include low memory, low 
computation power, they are deployed in hostile area and 
left unattended, small range of communication capability 
and low energy capabilities. Base on those 
characteristics makes this network vulnerable to several 
attacks, such as sinkhole attack. Sinkhole attack is a type 
of attack were compromised node tries to attract network 
traffic by advertise its fake routing update. One of the 
impacts of sinkhole attack is that, it can be used to 
launch other attacks like selective forwarding attack, 
acknowledge spoofing attack and drops or altered 
routing information. It can also used to send bogus 
information to base station. This paper is focus on 
exploring and analyzing the existing solutions which 
used to detect and identify sinkhole attack in wireless 
sensor network. The analysis is based on advantages and 
limitation of the proposed solutions. 

Keywords: Wireless sensor network (WSN), sinkhole 
attack, detection of sinkhole attack 

I. INTRODUCTION 

Wireless sensor network consists of small nodes with 
ability to sense and send data to base station [5]. 
Wireless sensor network is used in different applications 
example in military activities, which used to track 
movement of their enemy. It also used in fire detection 
and in healthy service for monitoring heart beat [2, 17, 
13]. Unfortunately most of wireless network are 
deployed in unfriendly area and normally left 
unattended. Also most of their routing protocols do not 
consider security aspect due to resource constraints 
which include low computational power, low memory, 
low power supply and low communication range [8,9]. 
This constraint creates chance for several attackers to 
easily attack wireless sensor network. An example of 
attack is sinkhole attack. Sinkhole attack is implemented 
in network layer where an adversary tries to attract many 
traffic with the aim to prevent base station from 
receiving a complete sensing data from nodes [20] .The 



adversary normally compromises the node and that node 
will be used to launch an attack. The compromised node 
send fake information to neighboring nodes about its link 
quality which used in routing metric to select best route 
during data transmission. Then all the packets from his 
neighbors pass through him before reach to base station. 
[22]. Sinkhole attack prevents base station from 
acquiring a complete and correct sensing data from 
nodes. 

The purpose of this paper is to study existing solutions 
used to detect sinkhole attack. Different solutions which 
were used to detect and identified sinkhole attack were 
suggested by different researchers, such as Krontiris 
[14], Ngai et al [18] and Sheela et al [25]. Rule based 
detection solution were proposed by Krontiris et al[15] 
to detect sinkhole attack. All the rules were focused on 
node impersonation and were implanted in intrusion 
detection system. Then intruder was easily detected 
when they violate either of the rules. Another centralized 
solution which involve base station in detection process 
proposed by Ngai et al [18] A non cryptography scheme 
which used mobile agent in the network to prevent 
sinkhole attack was also proposed by Sheela et al [25] 
The remainder of this paper is organized as follow. 
Section 2 discusses sinkhole attack and their attack 
mechanism in two different protocols. Section 3 presents 
the challenges in detection of sinkhole attack in wireless 
sensor network. Section 4 presents different approaches 
that proposed by different researchers to detect sinkhole 
attack. Finally, section 5 conclude this paper and 
proposed some future works. 

II. SINKHOLE ATTACK 

Sinkhole attack is an insider attack were an intruder 
compromise a node inside the network and launches an 
attack. Then the compromise node try to attract all the 
traffic from neighbor nodes based on the routing metric 
that used in routing protocol. When it managed to 
achieve that, it will launch an attack. Due to 
communication pattern of wireless sensor network of 
many to one communication where each node send data 
to base station, makes this WSN vulnerable to sinkhole 
attack (Ngai et al [18]). 
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The following subsections discuss the techniques use in sinkhole attack. 
MintRoute protocol and AODV protocol in launching 
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Figure 1: Sinkhole attack in MintRoute protocol (Krontiris, I[15]) 




Figure 2: Sinkhole in TinyAODV protocol (Teng and Zhang, [27]) 



Sinkhole Attack in MintRoute Protocol 

MintRoute protocol is a type of protocol which is 
commonly used in wireless sensor network. It was 
designed purposely for the wireless sensor network, it is 
light and suitable for sensor nodes which have minimum 
storage capacity, low computation power and limited 
power supply. MintRoute protocol uses link quality as a 
metric to choose the best route to send packet to the Base 
Station (Krontiris et al [15]). 

Fig.l shows six sensor nodes A, B, C, D, E, and F. Node 
C is malicious, and it is going to launch a sinkhole 
attack. The Figure 1(a) shows a route table of node A 
with IDs of its neighbors with their corresponding link 
quality. Originally the parent node was node B but node 
C advertises its link quality with a value of 255 which is 
maximum value. Node A is not going to change its 
parent node until the node B's link quality fall to 25 
below the absolute value. 



In Fig. 1(b) the malicious node is sending new update 
route packet that the link quality fall up to 20 and 
impersonate node B so that node A believe the packet 
come from node B. Node A will update its route table 
and change the parent node to node C (Krontiris et al 
[15]). The attacker uses node impersonation to launch an 
attack. 

Sinkhole Attack in TinyAODV Protocol 

This is another explanation of sinkhole attack in wireless 
sensor network and this time the attack is launched under 
TinyAODV (Ad-hoc On Demand Vector) protocol. 
TinyAODV protocol is the same as AODV in MANET 
but this one is lighter compared to AODV and it was 
modified purposely for wireless sensor network [27]. 
The number of hops to base station is the routing metric 
that used in this protocol. Generally the route from 
source to destination is created when one of the nodes 
send a request, the source node sends a RREQ (Route 
request) packet to his neighbors when wants to send 
packet. Next one of the neighbors close to destination is 
reply by sending back RREP (Route Reply) packet, if 
not the packet is forwarded to other nodes close to that 
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destination. Finally, the source receives RREP packet 
from neighbor then select one node with less number of 
hops to destination. 

The sinkhole node or compromised node launches an 
attack by send back RREP packet. In RREP packet it 
gives small number of hops which indicates close 
proximity to the base station. Then the source node 
decides to forward packet to sinkhole node. The 
compromised node then performs the same technique to 
its entire neighbors and tries to attract as much traffic as 
possible [27]. 

For instance, Fig. 2 shows node M launches sinkhole 
attack in Tiny AODV. Node A sends RREQ to nodes 
BCM. However node M instead of broadcast to node E 
like nodes B and C does to node D, he replies back 
RREP to node A. Then node A will reject node B and C, 
then forward packet to M because node A and B are very 
far to F compare to node M. 

III. CHALLENGES IN DETECTION OF 
SINKHOLE ATTACK IN WSNs 

Based on the literature review of sinkhole attack in 
wireless sensor network, the following are the main 
challenges in detecting sinkhole attack in wireless sensor 
network 

A. Communication Pattern in WSN; 
All the messages from sensor nodes in wireless sensor 
network are destined to base station. This created 
opportunity for sinkhole to launch an attack. Sinkhole 
attacks normally occur when compromised node send 
fake routing information to other nodes in the network 
with aim of attracting as many traffic as possible. Based 
on that communication pattern the intruder will only 
compromised the nodes which are close to base station 
instead of targeting all nodes in the network. This is 
considered as challenges because the communication 
pattern itself provides opportunity for attack. 

B. Sinkhole attack is unpredictable; 
In wireless sensor network the packet are transmitted 
based on routing metric that used by different routing 
protocols [26]. The compromised node used its routing 
metric that used by routing protocol to lie to his 
neighbors in order to launch sinkhole attack. Then all the 
data from his neighbors to base station will pass through 
compromised node. For example the techniques used by 
compromised node in network that used Tiny AODV 
protocol is different to the one used another protocol like 
MintRoute protocol. In MintRoute they used link quality 
as route metric while in Tiny AODV they used number 
of hop to base station as routing metric. Therefore the 
sinkhole attack techniques is changed based on routing 
metric of routing protocol 



C. Insider Attack 
Insider attack and outsider attack are two categories of 
attack in wireless sensor network. Outside attack is when 
intruder is not part of network. In inside attack the 
intruder compromises one of the legitimate node through 
node tempering or through weakness in its system 
software then compromised node inject false information 
in network after listen to secret information. Inside 
attack can disrupt the network by modifying routing 
packet. Through compromised node sinkhole attack 
attract nearly all the traffic from particular area after 
making that compromised node attractive to other nodes. 
The fact is that compromised node possesses adequate 
access privilege in the network and has knowledge 
pertaining to valuable information about the network 
topology this created challenges in detecting. Base to 
that situation even cryptographic cannot defend against 
insider attack although it provides integrity, 
confidentiality and authentication (Pathan, K [22]). 
Therefore the internal attack has more serious impact on 
victim system compared to outsider attack. 

D. Resource Constraints; 

The limited power supply, low communication range, 
low memory capacity and low computational power are 
the main constrained in wireless sensor network that 
hinder implementation of strong security mechanism. 
For example the strong cryptographic method that used 
in other network cannot be implemented in this network 
due to low computational power and low memory 
capacity. Therefore less strong key are considered which 
is compatible with available resources. 

E. Physical attack; 

A wireless sensor network normally deployed in hostile 
environment and left unattended. This provides a 
opportunity for an intruder to attack a node physically 
and get access to all necessary information [12]. 

IV. EXISTING APPROACHES 

Many researchers have been working on wireless sensor 
field to provide security mechanism to suits the resource 
constrained due to growing demand of applications in 
sensitive areas. The following are the identified 
approaches that used by different researchers to detect 
and identified sinkhole attack in wireless sensor network. 
Those approaches are classified into rules based, key 
management, anomaly based, statistical method and 
hybrid based. The subsequent subsections described each 
of these categories and give examples of existing work 
that used that approach. 

A. Rule based 
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The rules are designed based on the behavior or 
technique used to launch sinkhole attack. Then those 
rules are imbedding in intrusion detection system which 
runs on each sensor nodes. Those rules were then applied 
to the packet transmitted through the network nodes. If 
any node violates the rules is considered as adversary 
and isolated from the network. 

Among the existing work which used rules based 
approach include Krontiris et al [14]. Krontiris used rule 
based approach to detect sinkhole attack. They create 
two rules and implanted in Intrusion detection system 
(IDS). When one of the rules is violated by one of the 
nodes, the intrusion detection system triggered an alarm 
but it does not provide node ID of compromised node. 
The first rule "for each overhead route update packet the 
ID of the sender must be different your node ID". The 
second rule "for each overhead route update packet the 
ID of the sender must be one of the node ID in your 
neighbors". Also Krontiris et al [15] used the same 
approaches. There are two rules, the first rule "rule for 
each overhead route update packet the ID of the sender 
must be one of node ID in your neighbors". The second 
rule "for each pair of parent and child node their link 
quality they advertise for the link between them, the 
difference cannot exceed 50. 

B . Anomaly-based detection 

In anomaly based detection the normal user behavior is 
defined and intrusion detection is searching for anything 
that is anomalous in the network. In this method 
intrusion is considered as anomalous activity because it 
looks abnormal compare to normal behavior. The rule 
based and statistical approaches are also included under 
anomaly based detection approach. 

Tumrongwittayapak and Varakulsiripunth [29] proposed 
system that used RSSI (Received Signal Strength 
Indicator) value with the help of EM (Extra Monitor) 
nodes to detect sinkhole attack. The EM had high 
communication range and one of their functions is to 
calculate RSSI of node and send to base station with ID 
of source and next hop. This process happens instantly 
when node are deployed. Base station uses that RSSI 
value to calculate VGM (visual geographical map). That 
VGM shows the position of each node, then later when 
EM send updated RSSI value and base station identify 
there is change in packet flow from previous data this 
indicate there is sinkhole attack. The compromised node 
is identified and isolated from the network by base 
station using VGM value. However, if attack is launched 
immediately after network deployment, the system will 
not be able to detect that attack [29]. Also the numbers 
of EM nodes were not specified for specific number of 



sensor nodes and the proposed method is focused only 
on static network. 

C. Statistical method 

In statistical approaches the data associated with certain 
activities of the nodes in network is studied and recorded 
by researchers. For example monitor the normal packet 
transmitted between the nodes or monitor resource 
depletion of the nodes like CPU usage. Then the 
adversary or compromised node is detected by 
comparing the actual behavior with the threshold value 
which used as reference, if any nodes exceed that value 
is considered as an intruder. 

Chen, et al [3], proposed statistical GRSh (Girshick- 
RubinShyriaev)-based algorithm for detecting malicious 
nodes in wireless sensor network. Base station calculates 
the difference of CPU usage of each node after 
monitoring the CPU usage of each node in fixed time. 
Base station would identify whether a node is malicious 
or not after comparing the difference of CPU usage with 
the threshold. 

Dynamic trust management system was proposed by 
Roy et al [23] to detect and eliminate multiple attacks 
such as sinkhole attack. Each node calculates the trust of 
its neighbor node based on experience of interaction; 
recommendation and knowledge then sends to base 
station. The base station decided which node is sinkhole 
after it received several trust values from other nodes. 
Therefore the trust value of the node which falls beyond 
the normal value 0.5 is considered as sinkhole attack 
[23]. 



D. Hybrid based intrusion detection 

The combination of both anomaly and signature based or 
misused based is used in this approach. The false 
positive rate which produced by anomaly based is 
reduced in this approach due to the use of both method. 
Also the advantage of this approach is to be able to catch 
any suspicious nodes which their signature is not 
included in detection database. 

Coppolino and Spagnuolo [6] proposed hybrid Intrusion 
detection system to detect sinkhole attack and other 
attacks. They used detection agent which was 
responsible for identifying sinkhole attack. The hybrid 
intrusion detection was attached to sensor node and share 
resource of that node. The suspicious nodes were 
inserted to the blacklist based on anomalous behavior 
after analyzed the collected data from neighbors. Then 
that list is sent to central agent to make final decision 
based on feature of attack pattern (misused based). 
Similar to solution proposed by Tumrongwittayapak and 
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Varakulsiripunth [29], it was designed for static wireless 
sensor network. 

E. Key management 

In key management approach the integrity and 
authenticity of packet travels within the network is 
protected by using encryption and decryption key. Any 
packet transmitted in the network is added with another 
message in a way that to access that message requires a 
key and any small modification of the message can be 
easily detected. Those keys also help nodes to check if 
the message comes from base station and check the 
authenticity of the message. 

Papadimitriou et al [21] proposed a cryptographic 
approach in routing protocol to address the problem of 
sinkhole attack. Each node obtained public key which 
used to verify if the message comes from base station. 
They also used pair of public and private keys for 

Table 1: Existing works on Sinkhole detection 



authentication and sign data message. All keys were 
uploaded offline before the network was deployed. Their 
techniques prevented any node to hide its ID and any 
packet forgery between nodes in the network. This 
protocol is focused on resistance to sinkhole attack but 
not to detect and eliminate it. 

Meanwhile, Fessant et al [10] proposed two protocols 
which used cryptographic method to increase the 
resilience of sinkhole attack. Both protocols prevent 
malicious node from lying about their advertised 
distances to base station. However, they did not show the 
memory usage of their protocols and message size. 

The summary of existing works using the previously 
described approaches is shown in Table l.The summary 
covers evaluation results of proposed solution and their 
limitations 



Approach 


Proposed 
Solution 


Result 


Limitations/Advantages 


Rule Based. 

Krontiris et al 
2007 [16] 


They extended 
their IDS which 
can detect 
sinkhole attack. 


• the success of intrusion 
detection system depend on 
the increase number of 
watchdog 

• When the network density 
increase the false negative rate 
decrease. 


Limitations 

• Memory and network overhead 
was created. 

• They used MintRoute protocol 

• Node impersonation was the 
focus of the rules. 

Advantages 

• More secure and robust measure 
can be developed based on 
valuable principle they develop. 


Rule Based. 

Krontiris et al 
2008 [15] 


They proposed 
detection rules 
that will keep 
aware 

legitimate node 
the existing of 
attack. 


• They show how vulnerabilities 
of MultihopLQI can be 
exploited by sinkhole node 
and suggest the rules which 
make the protocol more 
resilient. 


Limitation 

• They did not show practically 
how those rules can prevent 
attack. 

• All the rules are only detecting 
attack but cannot give ID of 
sinkhole node. 

• They assume attacker has the 
same power as normal node and 
can capture sensor node and 
change the internal state. 


Anomaly 
based. 

Tumrongwitta 
yapak, C and 
Varakulsiripun 
th, R 2009 
[29] 


They proposed 
detection 
solution based 
on received 
signal strength 
indicator(RSSI) 

Their proposed 


• For 0 to 40% percentage of 
message drop the detection 
rate is 100% 

• False positive rate was 0 for 0- 
40% of message drop but 
increase when percentage drop 
increase 

• The same applied to false 
negative rate with the more 


Limitation 

• They assume sensor network are 
static 

• No instant attack 

• Base station remain 0,0 position 

• Base station and extra monitor 
node are physically protected. 

• Their proposed solution can not 
detect attack if it happened 
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solution 
required 
support from 
extra monitor 
node 


message drop the more 
negative rate. 


instantly after network 
deployment. 


Anomaly 
based. 

Choi et al 2009 
[4] 


They proposed 
method that can 
detect sinkhole 
attack that used 
LQI (link 
quality 
indicator). 


• The probability of detection 
increase when number of 
detector nodes increase 

• detection rate increase when 
detector node increase 

• The false positive rate depend 
on extent of tolerance value 
(constant value which will 
show if changes is beyond 
abnormal) 


Limitations 

• All sensor node have no mobility 

• The detection of sinkhole occurs 
when detector node is between 
sinkhole node and source node 
and sinkhole and base station 

• The detector nodes have high 
source of energy than sensor 
nodes 

Advantage 

• Detector node communicate 
themselves through exclusive 
channel 


Anomaly 
based. 

Sharmila, S. 
and 

Umamaheswar 
i, G. 2011. 
[24] 


-They proposed 
message digest 
algorithm to 
detect sinkhole 
node. 


• The results show the algorithm 
worked well when malicious 
nodes are below 50% 

• False positive rate was 20% ( 
due to packet drop) that figure 
obtained when malicious node 
reach 50 

• False negative error was 10% 
but was increasing when 
malicious node reach above 40 


Limitation 

• Network throughput, overhead 
and communication cost was not 
calculated 

• The performance was not good 
when there is node collision, 
limited transmitted power and 
packet drops 

• Only one advertisement is 
considered at a time, after 
computation another take place 

Advantage 

• The algorithm achieve data 
integrity and authenticity 


Key 

Management. 

Papadimitriou 
et al 2009 
[21] 


-They proposed 
two RESIST 
protocols which 
increase 
resilience to 
sinkhole attack 
in WSN 


-Results show that RESIST-0 has 
high resilience to sinkhole attack 
(it does not allow node to lie about 
their distance to base station) than 
other protocol 


Limitation 

• Resist-0 is very expensive it 
require two additional message 
to a packet 

• In their simulation message 
losses and collusion were not 
considered 

• Collusion node has impact on 
RESIST-0 not in RESIST- 1 

• Their routing algorithm relying 
on tree-based topology 
construction 

• Route tree is built by hop 
distance 

Advantage 

• RESIST- 1 prevent malicious 
nodes from changing their 
advertised distance to the sink 
more than one hop 

• RESIST-0 completely stops any 
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lying about distance. 


Statistical 
based 

Chen et al 

2010 

[3] 


They develop 
an algorithm 
which detect 
sinkhole attack 
and identified 
intruder. 


• From first simulation the 
detection time increase when 
threshold (CPU value) become 
bigger 

• Also the false positive rate 
decrease when threshold 
become bigger. 

• From the second simulation 
the detection time did not 
change too much but the false 
positive rate increase due to 
increase in traffic 


Limitation 

• Base station makes the final 
decision on which node is 
malicious 

• No results on the network 
overhead 

• The scheme will not detect 
attack if it launch instantly after 
deployed 

• Assumption-base station is 
trustworthy and it participates in 
detection system. 

Advantages 

• Their algorithm showed that it 
can detect malicious node in 
short time with low false 
positive rate 


Hybrid base 

Coppolino et al 

2007 

[6] 


They proposed 
intrusion 
detection 
system which 
was able to 
protect critical 
information 
from attacks 
directs from its 
WSN. 


• Detection rate was 95-97% 
when malicious node modified 
sensor packet. 

• Detection rate was 93-96% 
when malicious node modified 
the r 

• False positive rate is 3% 

• IDS usage in real sensor 
network was 734bytes (RAM) 
and 3208bytes (ROM) 


Advantage 

• Their solution satisfied the 
available resource in sensor 
nodes 

• Their solution proved to detect 
sinkhole attack 

• They used both anomaly and 
misuse based method 


A non 

cryptographic 

Sheela, D et al 

2011 

[25] 


They proposed 
scheme which 
used mobile 
agent to defend 
against this 
attack 


• Probability of detecting 
sinkhole is decrease when 
nodes increase 

• Node average energy decrease 
as time goes up because of 
storage information. 

• The algorithm create high 
network overheads 


Limitation 

• Mobile wireless sensor network 

• No specification of exactly 
number of MA(mobile agent) in 
network 

• Matrix method is very complex 
with relate to available resources 

• MA communicate with sensor 
nodes at active mode only 
Advantage 

• MA used dummy data to detect 
modification 

• MA has sufficient power to run 
its activities 



V. DISCUSSION 

From the Table I, it shows most approaches managed 

to detect and prevent sinkhole attack in WSN. 

Rule based approaches managed to detect sinkhole 

attack but it creates memory and network overhead. This 

approach did not give the ID of sinkhole node after 

detection of attack. All the rules focus on the node 

impersonation. 



Anomaly based approach also manage to detect sinkhole 
attack but they just focus on static wireless sensor 
network. This approach created high false positive rate 
when there was high message dropping. 
Key management was another approach which focused 
on resistance to sinkhole attack but not to detect and 
eliminate it. 

Statistical based approach managed to detect sinkhole 
attack but they did not give result of the network 
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overhead. Also this approach cannot detect an instant 
attack after WSN is deployed. False positive rate were 
the main drawback to this approach. 
Hybrid based intrusion detection approach used the 
combination of both anomaly and signature -based. This 
approach detected sinkhole attack but was designed for 
static WSN. It produced less false positive rate. 
A non cryptographic is another approach which detected 
sinkhole attack but it created high network overhead. 
All the approaches managed to detect, identify and 
provided resistance to sinkhole attack. The major 
drawbacks produce by those approaches includes high 
network and memory overhead, create high false positive 
rate and some were not able to work on mobile WSN. 



VI. CONCLUSION AND FUTURE WORK 

Based on existing works most researchers are trying to 
look for ICT solutions for detecting, identifying and 
providing resistance to sinkhole attack in wireless sensor 
network. Researchers used intrusion detection scheme 
based on anomaly-method, other used rule based and key 
management to detect and identifying the sinkhole 
nodes. Majority of researches struggled with security 
challenges corresponding with availability of resources 
and mobility of wireless sensor nodes. Some provided 
solution for only static and few on mobile network. Very 
few researchers managed to validate their security 
system using real wireless sensor network. Also some of 
results showed low detection rate, high network 
overhead and high communication cost. The future 
solution should focus on reducing high network 
overhead, computational power, increase detection rate 
and that system must be validated in real sensor network. 
Through this kind of validation, it will be easy to check 
if their solutions meet the available resources of WSN, 
such as memory capacity. 
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Abstract - Image decomposition is now essential for 
transmission and storage in database. Singular Value 
Decomposition is a decomposition technique for calculating 
the singular values, pseudo-inverse and rank of a matrix. 
The conventional way of doing this was to convert a matrix 
to row echelon form. The rank of a matrix is then given by 
the number of nonzero rows or columns of the echelon form. 
Singular value decomposition is one of the methods to 
compress and denoise the images. The main focus of this 
paper is to decompose and denoise an image using singular 
value decomposition. 

Key words :SVD, decomposition, de noise . 



I. Introduction 

A digital image is generally encoded as a matrix of grey 
level or color values. Each pair (i,u(i)), where u(i) is the 
value at i is called a pixel. The image accuracies are 
categorized as noise and blur. Noise is the perturbation 
of the image and blur is intrinsic to image acquisition 
system. A good quality image has a standard deviation 
of about 60. Image compression is one of the 
applications in SVD. Consider some matrix A with rank 
1000; that is, the columns of this matrix span a 1000 
dimensional space. Encoding this matrix on a computer 
is going to take quite a lot of memory. We might be 
interested in approximating this matrix with one of 
lower rank. An image is a section of random access 
memory that has been copied to another memory or 
storage location. Dimensionality reduction is a noise 
reduction process. Thus, SVD belongs to a class of 
dimensionality reduction techniques that deal with the 
uncovering of hidden data structures. If matrix A is in 
the form of A = USV T , where U is a matrix whose 
columns are the eigenvectors of the AA T matrix. These 
are termed the left eigen vectors. S is a matrix whose 
diagonal elements are the singular values of A. This is a 
diagonal matrix, so its nondiagonal elements are zero by 
definition. V is a matrix whose columns are the 
eigenvectors of the A T A matrix. These are termed the 
right eigenvectors. V T is the transpose of V [1]. 



A* = U* S* V T *. This process is termed dimensionality 
reduction, and A* is referred to as the Rank k 
Approximation of A or the "Reduced SVD" of A. The 
top k singular values are selected as a mean for 
developing the representation of A, which is now free 
from noisy dimensions [2]. 

II. Data Base 




Pixel of the image is 675 rows, 900 columns and 3 
colours. Signal processing aims at extracting information 
from the raw signal. The difficulty in reaching this goal 
depends both on the characteristics of the noise-free signal 
and the noise. The signal-to-noise-ratio (SNR) is the ratio 
of the strength of the signal and the strength of the noise. 
The higher the ratio the easier it is to extract information 
and the more reliable are the results. SNR is the ratio, of 
the mean and the standard deviation of the measured 
signal. 

SNR = X/s 

Calculate the signal as the mean of pixel values. Calculate 
the Pnoise and the standard deviation or error value of the 
pixel values. Take the ratio or you may 
use SNR=101ogl0(Psignal / Pnoise) to express the result 
in decibel. 0.8620.The above image with standard 
deviation 66.04. 

III. Model description 

The singular value decomposition closely associated to the 
companion theory of diagonal in a symmetric matrix. If A 
is a symmetric real n x n matrix there is an orthogonal 
matrix V and a diagonal D such that, 
A=VDV T . 

Here the columns of V are latent vectors for A and 
diagonal entries of D are eigen values of A for Singular 
Value Decomposition begin with m x n real matrix. There 
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are orthogonal matrices U and V and a diagonal matrix S, 
such that 

A = USV T 

Here U is m x m and V is n x n, so that S is rectangular 
with the same dimensions as A .The matrix S can be 
formatted to be non-negative and in order of decreasing 
order. The columns of U and V are called left and right 
Singular vectors for A[5],[8]. 



International Journal of Computer Science and Information Security, 

Vol. 13, No. 5, 2015 

approx_S = S; approx_S(l:ns, l:ns) = 
diag(approx_sigmas) ; 
approx_ncduck = U * approx_S * V; 
subplot(4, 2, j + 1), imshow(approx_ncduck), 
title(sprintf('Rank %d ncduck', ranks(j))); 
end 

Approximation rank of the image as follows it shown 
below: 



IV. Analysis 

The analysis is done in MATLAB software. The image 
can be converted to black and white, and then the image is 
treated as a matrix. 

a=imread('ncduck.jpg') ; 

imshow(a) 

[m,n,k]=size(a) 

ncduck = rgb2gray(imread('ncduck.jpg')); 
ncduck = im2double(imresize(ncduck, 0.5)); 
[U, S, V] = svd(ncduck); 

From the above algorithm the image is converted to the 
singular value decomposition, 
sigmas = diag(S); 

figure; plot(loglO(sigmas)); title('Singular Values (log 10 
Scale)'); 

The above algorithm shows the singular value of the 
image, which is a base 10 log scale. 



Full-Rank ncduck 



Rank 50 ncduck 




Approximately first thirty ranks get the largest singular 
values. 

figure; plot(cumsum(sigmas) / sum(sigmas)); 
title('Cumulative Percent of Total Sigmas'); 

The above algorithm lines shows the cumulative 
percentage of the singular value . 




figure; subplot(4, 2, 1), imshow(ncduck), title('Full-Rank 
ncduck');ranks = [50, 30, 20, 10, 3,2]; 
for j = l:length(ranks) 

approx_sigmas = sigmas; approx_sigmas (ranks (j): end) = 
0; 

ns = length(sigmas); 



Rank 30 ncduck 



Rank 20 ncduck 



Rank 10 ncduck 



From the above images it shows that, higher the 
singular value, better the quality of the image. Singular 
value decomposition compressed a 675x900 pixel 
image into a 675 x 675 for U, 30 x 30 singular value 
and a 30 x 900 matrix. Singular values can be used to 
highlight which dimensions are affected the most when 
a vector is multiplied by a matrix. 

V. De noise. 
The singular value decomposition , 

A = USV T t A T = (USV T ) T The discrete nQisy 
image v = {v(i)/i e /} _ co(p,q) depends Qn ^ 

similarity between the pixels p and q and satisfies the 
condition of the probability 

0 < a>(p,q) < 1 and ^a>(p,q) = 1 

q 

[4] Computing the similarity between the images pixels 
will depend on the similarity of the intensity grey level of 
vector p referred as black and q as white. Then, 

P = P U k S k 2Lll d q " q U k S k . The similarity of p 

and q sim{ Pj q) = sim{p T U K S K \q T U K S K l ) The 

large weight of similarity windows are similar and smaller 
because of the intensity grey values in the similarity 
windows are varying. White noise alters the distance 
between windows in a uniform way. Impixel region 
creates a Pixel Region associated with the image 
displayed in the current figure, called the target image. 
The Pixel Region tool opens a separate figure window 
containing an extreme close-up view of a small region of 
pixels in the target image [7]. 
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VI. Compression model 

Use the compress button to bring up the Wavelet Packet 
2-D Compression window. Select the remove near zero 
option from the Select thresholding method. Threshold of 
the image is 4.995. [3], [6]. 




Notice that the default threshold (1.5 higher degree) 
provides about 43.61% compression while retaining 
virtually all the energy of the original image. Depending 
on criteria, it may be worthwhile experimenting with more 
aggressive thresholds to achieve a of compression. This 
can be considered a precompression step in a broader 
compression system. Peak Signal to Noise Ratio ( PSNR) 
and Mean Square Error (MSE) are used to compare the 
squared error between the original image and the 
reconstructed image. There is an inverse relationship 
between PSNR and MSE. Higher PSNR value indicates 
the higher quality of the image. 




PSNR of the compressed image is 28.02, which means, 
28% of the noise removed from the image. 



International Journal of Computer Science and Information Security, 
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VII. Conclusion 

The results show that, one of the main decomposition 
approach based on singular value decomposition for 
adaptive noise is de-noising. Experimental results are 
proposed for performance of PSNR on visual effect in 
color images, even in presence of high ratio of noise. High 
PSNR value reveals good image quality and less error 
introduced in to the image. In case of loss less 
compression PSNR will be high. Further the work can be 
extended to 3D framework image denoesing and image 
restoration. 
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Abstract — The problem of optimizing distributed database 
includes: fragmentation and positioning data. Several different 
approaches and algorithms have been proposed to solve this 
problem. In this paper, we propose an algorithm that builds the 
initial equivalence relation based on the distance threshold. This 
threshold is also based on knowledge- oriented clustering 
techniques for both of horizontal and vertical fragmentation. 
Similarity measures used in the algorithms are the measures 
developed from the classical measures. Experimental results 
carrying on the small data set match fragmented results based on 
the classical algorithm. Execution time and data fragmentation 
significantly reduced while the complexity of our algorithm in the 
general case is stable. 

Keywords — Vertical Fragmentation; Horizontal 
Fragmentation; Similarity Measure; Clustering Techniques 
knowledge-oriented clustering techniques. 

I. Introduction 

In distributed computing environments, each unit of data 
(item) which is accessed at the station, (site) is not usually a 
relationship but part of the relationship. Therefore, to optimize 
the performance of the query, the relations of global schema 
are fragmented into items. 

There are several types of data fragmentation that are use 
vertical fragmentation, horizontal fragmentation, mixed 
fragmentation and derived fragments. Two classical algorithms 
associated with horizontal fragmentation and vertical 
fragmentation are PHORIZONTAL and BEA respectively 
[11]. Many authors have proposed solutions improved the 
above two algorithms, as Navathe, et al., (1984) [13], 
Chakravarthy, et al., (1994) [3]..., However, the complexity of 
this algorithm is quite large, with vertical fragmentation 
problem is 0(n 2 ), where n is the number of attributes and 
horizontal fragmentation is 0(2 m ), where m is the number of 
records [9],[11]. 

In recent years, several authors have incorporated to solve 
the problem of fragmentation and positioning, by using the 
optimal algorithms [5-6], [10] or using the heuristic method 



[4], [7]. The execution time of these algorithms is remarkably 
smaller than the classical algorithm. 

The used technical association rules in data mining to 
vertical fragmentation has been mentioned in [8]. However, the 
data mining techniques do not attract many authors. 

In this paper, we use knowledge-oriented clustering 
techniques for vertical and horizontal fragmentation problem. 
The measure of similarity was developed based on from the 
available measure of the classical algorithms in data mining. 

In the clustering algorithm based on knowledge-oriented, 
we propose an algorithm that builds the initial equivalent 
relation based on the distance threshold. This approach differs 
from the previous works proposed by Hirano et al.[10], [14] 
and Bean et al.,[2], [14-16] in that the proposed algorithm 
automatically determines the number of clusters based on the 
data set of survey. 

The paper is organized as follows: section 2 presents a brief 
overview of the basic concepts. We detail with the proposed 
vertical and horizontal fragmentation algorithms, in section 3 
and section 4, respectively. We then discuss the main 
contributions of proposed approach in section 5. 

II. BASIC CONCEPTS 

A. Vertical fragmentation 

Vertical fragmentation is the collective decay properties of 
the relational schema R into the sub schema R l9 R 2 , ..,R m? such 
that each attribute in these sub schemas is often accessed 
together. 

To show how often the same queries together, Hoffer and 
Severance introduced the concept attribute affinity [11]. 

If Q = {q b q 2 , .., q m } is a set of applications, R(A b A 2 , .., 
A n ) is a relational schemas. The relationship between qi and 
attributes Aj is determined by using the values: 
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{1, A j is engaged in 
0, A j is not engaged m 

Put (Ai, Aj) = {qeQ | use(q, Ai ). use(q, Aj ) = 1}. Attribute 
affinity between Ai and Aj is: 



Aff(A i ,A j ) = Z ( ^refi(q)*acci(q)) (2) 
qcQ(Ai,Aj) VS/ 

In particular, refl(q): the number of pairs of attributes 
(A i? Aj) is referenced in the application q at station S\, accl(q): 
frequency of access to applications q in station Si. BEA 
algorithm consists of two main phases: 

1 ) Permutations row, column affinity matrix of attribute to 
obtain the cluster affinity matrix (CA) which has global 
affinity measure AM (global affinity measure) [11] is the 
largest. 

2) Find the partition of the set of attributes from the 
matrix CA by exhaustive method, so that: 

Z= CTQ *CBQ - COQ2 is the maxima, with: 

CTQ= X X refj(qj)accj(qi) 
qzTQVSj 

CTQ= I I refj(qj)accj(qi) 
qzTQVSj 

COQ= Z I refj(qj)accj(qi) 
q^OQVSj 



TABLE I. Cluster affinity matrix CA 





A, 


A 2 




A t 


A i+1 




A„ 
























TA 










Ai 
















A i+1 




























BA 





















In which, 

AQ(qO= {Aj|use( qi , A0=1}; 
TQ={ qi |AQ(qO ^TA}; 
BQ= {qi|AQ( qi ) ^BA}; 
OQ=Q\ {TQ uBQ} 

The complexity of the algorithm is proportional to n 2 . 



B. Horizontal Fragmentation 

Horizontal fragmentation divides set records into a smaller 
set of records. Horizontal fragmentation is based on the query 
conditions, which are expressed through simple predicates of 
the form: Aj 0<value>. 

Set Pr = {Pr b Pr 2 , ..,Pr k } is a set of simple predicates 
extracted from a set of applications. A conjunction of the 
predicates, which is built from P r will have the form: 

I\ AP 2 A...AP n (3) 

Where p* is a predicate, which received one of p. or — . p. 
values. 

PHORIZONTAL algorithm uses the conjunction of the 
predicates p. p to find the conditions for horizontal 

fragmentation of data [9]. The relation r(R) will be fragmented 
into { ri (R), r 2 (R),..,r k (R)}, with r, (R) = a Fi(r(R)), 1 ^ i ^ k; 
Fi is a predicate, which forms the conjunction of the primary 
predicates [9]. 

C. Information systems and the inability to distinguish 
relationship 

• The information system is a pair of SI = (U, A), where 
U is a finite set of objects U={ti, t 2 , ..,t n }, A is non- 
empty finite set of attributes. 

• An equivalent relation (A binary relations satisfy 
properties reflective, symmetric and transitive) defined 
on U is called an inability to distinguish relationship 
(irrespectively relationship) on U. 

D. Clustering algorithm Knowledge-Oriented 

Clustering algorithm Knowledge-Oriented based on rough 
set theory was first proposed by the authors Shoji Hirano, et al., 
[10], [15]. This is a clustering algorithm automatically 
determines the number of clusters based on the survey data 
[12]. The main idea of this clustering algorithm consists of 2 
phases: 

1) Created of equivalence relation on the set of object 
clustering. 

2) Editing of the equivalence relation using a threshold Tk 
based on the measure irrespective. This iterative process will 
update Tk a best clustering results is obtained Using this 
algorithm to data fragmentation, we have proposed initial 
equivalence relation based on the average distance between 
objects. 

Clustering algorithm Based on knowledge orientation, so 
we propose as follows: 

Input: U= the set of objects to be clustered. 

(Each object must be describe the information needed to 
construct a similar measure). 
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Output: The clusters (corresponding to the fragment of 
data). 

Method: 

Step 1: Construct a matrix of similarities S=S(t i? tj) between 
all pairs of objects (t i? tj); 

Step 2: Specify a initial ability to distinguish relationship Ri 
for each object. Synthesis to get an initial clustering; 

Step 3: Construct ability to distinguish matrix r=y(t i? tj) to 
assess the quality of clustering; 

Step 4: Modify the clusters by the inability to distinguish 
relationship Rj mod for each object to achieve the revised 
clustering; 

Step 5: Repeat steps 3 and 4 until a stable clustering is 
obtained. 



The inability to distinguish relationship corresponds to the 
i th attribute: 

Ri= {(ktjeUxU:d(fctj) < Tk j? with j =1,2,..., n}. 

Where d(t i? tj) is the distance between two participants 
clustering. 

The threshold Th is determined as follows: 



Tk 



Z (l-s(ti,tj)) 
7 = 1,7** 



l{n-\) 



With s(t i? tj) is the similarity measure of two objects t i? tj. 
III. Vertical fragmentation problem based on 

KNOWLEDGE- ORIENTED CLUSTERING TECHNIQUES 

Vertical fragmentation problem is converted to the 
clustering problem, based on the following concepts: 

A. Attribute and the reference feature vector 

Definition 1: The reference measure of transaction qi on 
attribute Aj, denoted by M(q i? Aj): 

My= M(q i? Aj) = use(q iA j)*fi 

In which My is the frequency with which transactions qi 
reference to attribute Aj. With £ is the frequency of transactions 
qi and use(q i? Aj) is defined by formula (1). 

Definition 2: VAj reference feature vector of attribute Aj 
with reference transactions (qi, q 2 , ..,q m ) is defined as follows: 





qi 


q 2 


q m 


VA j = 


My 


M 2J 




M mj 



B. The similarity measure of two properties 

Definition 3: The similarity measure of two attributes A k , 
Ai has two feature vectors corresponding to the reference 
transactions (q b q 2 , ..,q m ): 



VA k = (M lk , M 2k , ..Mmk) 

VA x =(M lh M 21 ,..,M ml ) 

Is determined by the cosine measure: 



s(A k ,Ai) 



VA k *VA/ 



m 

ZM ik *M U 
i=l 



(5) 



m 



m 



C. Vertical fragmentation based on knowledge- oriented 
clustering technique 

To illustrate the vertical fragmentation algorithm based on 
knowledge-oriented clustering techniques. We use the 
assumption of examples about vertical fragmentation problem 
based on BEA algorithm is presented in [1], [8]: 

The set of attributes A t = {A u A 2 , A 3 , A 4 } 

The set of transactions Q = {qi, q 2 , q 3 , q 4 }. The matrix 
used: 









A 2 


A 3 


A 4 




qi 


C\ 


0 


1 


(T 




q2 


0 


1 


1 


0 


(4) 


qs 


0 


1 


0 


l 




q4 


0 


0 


1 


l 



The frequency of application execution with a set of 
transactions {q h q 2 , q 3 , q 4 }, and F = {f u f 2 , f 3 , f 4 } = {45, 5, 75, 
3}. 

From the assumption, we have the reference feature 
vectors: 





qi 


qi 


qs 


q4 


VA 1 = 


45 


0 


0 


0 


VA 2= 


0 


5 


75 


0 


VA 3= 


45 


5 


0 


3 


VA 4= 


0 


0 


75 


3 



The similar matrix S 4x4 = (s(A k , A^) k=l,4; 1=1,4 



Ai 
A 2 
A 3 
A 4 



Ai 
1 



A 2 
0 
1 



A 3 
0.9918 
0.0073 
1 



A 4 - 
0 

0.9970 
0.0026 
1 . 



The result of the vertical fragmentation algorithm based on 
the clustering algorithm towards knowledge-oriented. 



Cluster 


Set of attributes 


1 


{Ai,A 3 } 


2 


{A 2 ,A 4 } 



This fragmentation results correlate with the results of the 
vertical fragmentation by algorithm BEA. 
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IV. Horizontal fragmentation problem based on 

KNOWLEDGE-ORIENTED CLUSTERING TECHNIQUE 

Similar to vertical fragmentation, assuming conversion of 
horizontal fragmentation problem from PHORIZONTAL 
algorithm is based on the concept of the following 
establishments: 

A. Vectorization records of a relationship 

Considering the relations r(R) = {T 1? T 2 , ..,Ti}, the set of 
simple predicates extracted from applications on r(R) is Pr = 
{Pr b Pr 2 , .., Pr m }. Vector binary of records by the rule: 



TABLE II. 



Vectorization binary 





Prl 


Pr2 








p 

1 rm 


Ti 


an 


a i2 




Aii 




aim 
















Ti 


aii 


a i2 




a^ 




aim 


















an 


ai2 




ai| 




aim 



V a 



1 , if Ti [Pr j ] = true 
lJ ~ 1 0, if Ti [Pr j ]= false 



B. The similarity measure of two binary vector 

Consider two vectors Xi and x j? that are represented by 
binary variables. Assuming binary variables have the same 
weight. We have event tables as Table 3. Where q is the 
number of binary variables equal to 1 for the two vectors xi 
and Xj, s is the number of binary variables equal to 0 fo simpler 
xi but equal to 1 for Xj, r is the number of binary variables 
equal to 1 for xi but is 0 for Xj, t is the number of binary 
variables equal to 0 for all vectors Xi and xj. 



TABLE III. 



Event table for binary variables 





Object j 






1 


0 


Sum 


Object i 


1 


q 


r 


q+r 


0 


s 


t 


s+t 




Sum 


q+s 


r+t 


P 



• The difference of two vectors Xi and Xj based on the 
symmetric binary dissimilarity are: 



d(x i ,x j )= 

; q+ r+ s+ t 



(6) 



• The similarity measure between two vectors Xi and Xj is 
defined by the Jaccard coefficient: 



C. Horizontal fragmentation based on knowledge-oriented 
clustering techniques 

Considering the relations EMP, [11]: 

TABLE IV. Sample data (emp) for horizontal fragmentation 





ENo 


EName 


Title 


Ti 


Ei 


Jjoe 


Elect-Eng 


T 2 


E 2 


M.Smith 


Syst-Analyst 


T 3 


E 3 


A.Lee 


Mech-Eng 


T 4 


E 4 


J.Smith 


Programmer 


T 5 


E 5 


B.Casey 


Syst-Analyst 


T 6 


E 6 


L.Chu 


Elect-Eng 


T 7 


E 7 


R.David 


Mech-Eng 


T 8 


Eg 


J.Jone 


Syst-Analyst 



Consider two simple predicates: 

Pi=(Title > "Programmer"); p 2 =(Title < " Programmer"), 
with string comparison rules in alphabetical order. 
Vectorization records by two predicates pi and p 2 are: 



TABLE V. 



Vectorization records 





Pi 


P2 


Ti 


1 


0 


T 2 


0 


1 


T 3 


1 


0 


T 4 


0 


0 


T 5 


0 


1 


T 6 


1 


0 


T 7 


1 


0 


T 8 


0 


1 



D. The result horizontal fragmentation of the relation (EMP) 

The result horizontal fragmentation of the relation EMP as 
Table III based on the clustering algorithm towards knowledge- 
oriented. We have used the similarity measure defined by 
formula (7), where d(x i? Xj) is calculated by the formula (6). 

With k=2, we have: 



Cluster 


The set of records 


1 


Ti, T 3 , T 6 , T- 


2 


T 2 , T 4 , T 5 , T 8 



With k=3, we have: 



Cluster 


The set of records 


1 


Ti, T 3 , T 6 , T 7 


2 


T 2 , T 5 , T 8 


3 


T 4 



sim{x(,x j ) = \-d{x( 9 x j ) 



(7) 
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And with k=4, we have: 



Cluster 


The set of records 


1 


Ti,T 3 


2 


T 2 , T 5 , T 8 


3 


T 4 


4 


T 6 , T 7 



This fragmentation results coincide with the results of the 
horizontal fragmentation by algorithm PHORIZONTAL. 

V. Conclusion 

In this paper, we used knowledge-oriented clustering 
techniques for fragmentation problem in distributed database 
systems. With this solution, we also proposed transforming 
hypothetical of this problem to hypothetical of clustering 
problems. Experimental results on the data in [11] that results 
correlate with the results obtained from the classical 
fragmentation algorithms PHORIZONTAL and BEA. In 
addition to experimental data as presented, we also tested on a 
number of different data set. The results are also similar to the 
two classical algorithms above. In the future work, we will 
carry out the Analysis of large data sets that to compare test 
the usability of the proposed solution. 
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Abstract — Understanding or comprehending source code is one of 
the core activities of software engineering. Understanding object- 
oriented source code is essential and required when a 
programmer maintains, migrates, reuses, documents or enhances 
source code. The source code that is not comprehended cannot be 
changed. The comprehension of object-oriented source code is a 
difficult problem solving process. In order to document object- 
oriented software system there are needs to understand its source 
code. To do so, it is necessary to mine source code dependencies 
in addition to quantitative information in source code such as the 
number of classes. This paper proposes an automatic approach, 
which aims to document object-oriented software by visualizing 
its source code. The design of the object-oriented source code and 
its main characteristics are represented in the visualization. 
Package content, class information, relationships between classes, 
dependencies between methods and software metrics is displayed. 
The extracted views are very helpful to understand and 
document the object-oriented software. The novelty of this 
approach is the exploiting of code dependencies and quantitative 
information in source code to document object-oriented software 
efficiently by means of a set of graphs. To validate the approach, 
it has been applied to several case studies. The results of this 
evaluation showed that most of the object-oriented software 
systems have been documented correctly. 

Keywords- Vsound; software engineering; software 
documentation; software visualization; software understanding; 
software maintenance; software evolution; software reuse; change 
impact analysis; object-oriented source code; reverse engineering. 

I. Introduction 

Different studies about the software understanding indicate 
that programmers rely on good software documentation [22]. 
Software rapidly becomes very complex when its size 
increases, which make its development very hard task. The 
very huge amount of information represented in the software 
source code, at all granularity levels (i.e., package, class, 
attribute and method) make understanding and documenting 
software a very difficult, lengthy, and error-prone task [3]. 
Moreover, manually-written documentation is not feasible for 
being incomplete, either because it is very time-consuming to 
create, or because it must frequently be updated [12]. This 
paper proposes a new approach called Vsound 1 to understand 
and document the object-oriented (00) software by visualizing 
its source code as a set of graphs at all granularity levels. In 
order to give a precise definition of the software 



1 Vsound stands for Visualizing object-oriented Software for Understanding 
and Documentation. 



documentation, Vsound considers that the software 
documentation is the process of taking software source code 
and understanding it by visualizing its source code as a set of 
graphs. 

Software visualization is the use of the crafts of typography, 
graphic design, animation, and cinematography with modern 
human-computer interaction and computer graphics technology 
to facilitate both the human understanding and effective use of 
computer software [13]. Software visualization (resp. software 
documentation) can tackle three different types of aspects of 
software (i.e., static, dynamic and evolution) [2]. The 
visualization of the static aspects of software focuses on 
visualizing software as it is coded. While, the visualization of 
the dynamic aspects of software represents information about a 
specific run of the software and helps comprehend program 
behavior and, at last, the visualization of the evolution of the 
static aspects of software adds the time factor to the 
visualization of the static aspect of software. This paper tackles 
only the visualization (resp. documentation) of the static 
aspects of software. 

Software comprehension 2 is the process whereby a software 
practitioner understands a software artifact using both 
knowledge of the domain and/or semantic and syntactic 
knowledge, to build a mental model of its relation to the 
situation [14]. Software understanding is one of the main 
software engineering activities. Software understanding is the 
process of taking software source code and understanding it. 
Software comprehension is necessary when a programmer 
migrates, reuses, maintains, documents or enhances software 
systems. Software that is not comprehended cannot be changed 
[1]. The domains of software documentation and visualization 
are driven by the need for program comprehension. Software 
visualization (resp. documentation) is a successful software 
comprehension way. Software comprehension is an important 
part of software evolution and software maintenance. 

The software maintenance process is the most expensive 
part of software development. Most of time spent in software 
maintenance is used to comprehend the software code and the 
instructions that have to be changed [17]. Software 
maintenance is the modification of a software product after 
delivery to correct faults, to improve performance or other 
attributes, or to adapt the product to a changed environment 
[16]. The software undergoes modification to source code and 
related documentation due to a problem or the need for 



2 a.k.a., "program understanding" or "source code comprehension". 
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enhancement. The goal is to modify the existing software while 
preserving its integrity [15]. 

Software must frequently evolve to adjust to new features 
(resp. environments) or to meet specific requirements. 
Software system wants to evolve in order to be used longer 
time. The company often evolves frequent release of new 
versions of the original software. Each release results in the 
increase of software system size and complexity. Thus, 
software implementation that facilitates modify is key for 
reducing maintenance costs and effort. Software evolution 
reflects the process of progressive change in the attributes of 
the evolving entity or that of one or more of its constituent 
elements [20]. In other words, software evolution is linked to 
how software systems evolve over time. 

Software reuse is important to reduce the cost and time of 
software development. In order to reuse existing source code 
there is a need to understand and document the software code. 
With increasing of software complexity (i.e., increasing count 
of lines of code) the need for reuse grows. Software reuse is the 
process of creating software systems from existing software 
rather than building software systems from scratch [18]. 
Software reuse helps reduce the development and maintenance 
effort. In addition, software reuse improves software quality 
and decrease time-to-market [38]. 

Software comprehending is very necessary for software 
changes in the maintenance stage. The changes to the 
software's code may be affected to another part of the code. 
These situations make the developer spent more time and effort 
to find the affected lines of the whole code. Change impact 
analysis is the process of identifying the potential consequences 
of a change, or estimate what needs to be modified to 
accomplish a change [21]. Change impact analysis support 
program understanding by finding the potential effect or 
dependency information in source code. 

This paper proposes a new approach, which aims to 
document software systems by visualizing their code. The 
documentation process is very useful for software 
understanding, maintenance, evolution, reuse and changes. 
Documentation process involves the creation of alternative 
representations of the software, usually at a higher level of 
abstraction. It also involves analyzing the software in order to 
determine its elements and the relations between those 
elements. Software visualization is commonly used in the fields 
of reverse engineering and maintenance, where huge amount of 
code need to be understood [23]. Reverse engineering is the 
process of analyzing a subject system to identify the system's 
components and their interrelationships and create 
representations of the system in another form or at a higher 
level of abstraction [16]. 

To assist a human expert to document software system, this 
paper proposes an automatic approach, which generates a set of 
graphs (i.e., documents) using source code elements of a 
software system. Compared with existing work that documents 
source code (cf. section related work), the novelty of Vsound 
approach is that it exploits source code dependencies and its 
quantitative information, to document 00 software in an 
efficient way by means of a set of graphs. Vsound accepts as 
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input the source code of software as a first step. Then, based on 
the static code analysis, Vsound generates an XML file 
contains the main source code elements (e.g., package, class 
attribute and method) and the dependencies (e.g., inheritance) 
between those elements. Next, Vsound documents the software 
by extracting a set of graphs based on the source code; each 
graph considers as a document. The mined documents cover all 
granularity levels (i.e., package, class, attribute and method) of 
the source code. 

The Vsound approach is detailed in the remainder of this 
paper as follows. Section 2 briefly presents the background 
needed to understand the proposal. Section 3 shows an 
overview of Vsound approach. Section 4 presents the software 
documentation process step by step. Section 5 describes the 
experiments that were conducted to validate Vsound proposal. 
Section 6 discusses the related work, while section 7 concludes 
and provides perspectives for this work. 

II. BACKGROUND 

This section provides a glimpse on software documentation 
and visualization. It also shortly describes dependencies 
between source code elements which consider relevant to 
Vsound approach. 

The software documentation process aims to generate 
documents with abstract information based on software source 
code. The extracted documents are very useful, especially when 
software documents are missing. The good software 
documentation process helps the programmer working on 
software to understand its features and functions. The software 
documentation process is specific, where it provides all the 
information important to the person who works on the software 
at different levels of abstraction. The documentation process 
aims to translate source code of the software system into a set 
of documents (i.e., graphs). In reality, several software systems 
have little to no software documentation, especially the legacy 
software. Many companies are facing some problems with 
legacy systems such as: software understanding and software 
maintenance. The reason behind these problems is the absence 
of software documentation. Software systems that are not 
documented hard to be changed [41]. Common examples of 
such documentation include requirement and specification 
documents [22]. Vsound provides as output a set of documents 
describe the source code and its dependencies. 

Software visualization is the graphical show of information 
about the software source code. Software visualization is not a 
simple process since the amount of information to be included 
in the graph is may be far bigger than can be displayed. The 
software visualization tool presents information about the 
software source code at different levels of abstraction. The 
visualization tool focuses on displaying different aspects of the 
source code. It should provide a way to choose and display just 
particular information based on the software source code. 
Usually, the visualization process must find out the level of 
abstraction of the information it presents about the software 
source code. The software visualization tool must convert the 
software source code into a graph. It also must able to visualize 
a huge amount of information regarding the software source 
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code. Moreover, the visualization tool must provide an easy 
way for navigation [24]. 



Vsound relies on the source code of software systems to 
extract a set of documents describing the software. The 
software source code is the most important resource of 
information when the software documentation is missing. In 
order to document existing 00 software based on its source 
code there is a need to extract the main source code elements 
and dependencies between those elements. Vsound considers 
that the software implementation consists of 00 source code 
elements and the code dependencies. The main elements of 00 
source code include: package, class, attribute and method. 
Vsound also focuses on the method body. The method body 
consists of a method signature, which represents the access 
level, returned data type and parameter list (i.e., name, data 
type and order). It also consists of the body of the method (i.e., 
local variable, method invocation and attribute access). Vsound 
focuses on the main dependencies between source code 
elements such as: inheritance, attribute access and method 
invocation in the documentation process. Inheritance relation 
occurs when a general class (i.e., superclass) is connected to its 
specialized classes (i.e., subclasses). Method invocation 
relation occurs when methods of one class use methods of 
another class. While, attribute access relation occurs when 
methods of one class use attributes of another class. 

III. Approach Overview 

This section provides the main concepts and hypotheses 
used in the Vsound approach. It also gives an overview of the 
software documentation process. Then, it presents the 00 
source code model. Finally, it shortly describes the example 
that illustrates the remaining of the paper. 

A. Key Ideas 

The general objective of Vsound approach is to document 
the source code of a single software system based on the static 
analysis of its source code. Mining the main entities of the 
source code in addition to the source code dependencies is a 
first necessary step towards this objective. Vsound considers 
software systems in which software functionalities are 
implemented at the programming language level (i.e., source 
code). Vsound also restricts to 00 software system. Thus, 
software functionalities are implemented using 00 source code 
elements such as packages, classes, attributes, methods or 
method body elements (i.e., local variable, attribute access, 
method invocation). There are several ways to document the 
software source code such as generate a descriptive comments 
that summarize all software classes or methods [12]. Vsound 
aims to document the software source code as a set of 
documents (i.e., graphs). The software documentation process 
via single graph is difficult. Thus, there is a need to document 
the software source code through several documents with 
details. 

The documentation process must cover all granularity 
levels of the software source code (i.e., package, class, attribute 
and method). Vsound focuses on the extracting of three types 
of documents (i.e., graphs). The first document contains 
general information about the software source code. This 
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document called the package document. It represents all 
software packages in addition to the number of classes per 
package. It also provides quantitative information about a 
software system. The second document contains information 
about software classes. This document called the class 
document. It represents all information about the class, such as 
the number of attributes and methods in addition to class 
dependencies. The third document contains information about 
software methods. This document called the method document. 
It represents all information about the method, such as the 
number of parameters and local variables in addition to method 
dependencies like attribute access and method invocation. 



B. The 00 Software Documentation Process 

Vsound goal is to document the 00 software by using the 
source code of this software. The software documentation 
process takes the software's source code as its inputs and 
generates a set of documents as its outputs. 



Input 



Software source code 



Package Drawing. Shapes; 
import java.awt.*, 

public class MyShape { 
private int x 1 ; 
private int x2; 

// constructor 



Vsound Appro ach 



Static analysis 



Identifying software documentation 



Output 



Set of documents 



Implementation space 



Legend: 




□ 

Process 


o 

Output 


Document 



Package level 



Documentation space 



Figure 1. The OO software documentation process. 

Vsound approach exploits the main source code elements 
and the dependencies between those elements in order to 
document and understand existing software system. Figure 1 
shows the software documentation process. The first step of 
this process aims at extracting the main 00 source code 
elements (i.e., package, class, attribute, method) and their 
relationships (i.e., inheritance, method invocation and attribute 
access) based on the static analysis of source code. In the 
second step, Vsound approach relies on the mined source code 
to document software at package level. In the third step, 
Vsound approach documents the software at class level based 
on the extracted source code. The last step of this process aims 
at documenting the software at method level. Finally, these 
documents (i.e., graphs) are used to understand and document 
the software system. 

C. Object-oriented Source Code Model 

The Vsound source code meta-model was inspired by the 
FAMIX [39] information exchange meta-model. The source 
code meta-model (cf. Figure 2) displays the main source code 
elements and their relations. Mainly, the reader gets enough 
information if he considers the main type entities that construct 
an object-oriented system. These are package, class, interface, 
attribute, method, and the relations between them, such as 
inheritance, access and invocation. The 00 source code model 
shows structural source code entities such as: packages, classes 
and methods. In addition, this model represents explicitly 
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information such as: a class inherits from another class (i.e., 
inheritance), a method accesses attributes (i.e., access) and a 
method invokes other methods (i.e., invocation). These 
abstractions are very important and needed for reengineering 
tasks such as: dependency analysis [40]. 



superclass 



Class 



InheritanceDefinition 



O 



bclongsToClass 



bclongsToClass 




rrvokedBy 



Method 



candidates accessedli 



Attribute 



Invocation 



Access 



Figure 2. The source code meta-model [39]. 

D. An Illustrative Example 

As an illustrative example, this paper considers the drawing 
shapes software 3 (cf. Figure 3). This software allows a user to 
draw three different kinds of shapes. The drawing shapes 
software allows user to draw lines, rectangles and ovals and 
choose the color of the drawn shape (cf. Figure 3). This 
example used to better explain some parts of this paper. 
However, Vsound approach only uses the source code of 
software as input of the documentation process and thus do not 
know the code dependencies or software metrics in advance. 
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A. Extracting the Source Code of Software via Static Code 
Analysis 

Static code analysis is the analysis of computer software 
that is performed without actually executing programs built 
from that software [4]. While, analyzing the actions performed 
by a program while it is being executed is called dynamic 
analysis [5]. Vsound approach accepts as input the source code 
of software. Then, the proposed approach generates an XML 
file based on the static code analysis 4 . The mined XML file 
contains structural information between 00 source code 
elements (e.g., draw method is inside the My Line class). It also 
contains structural dependencies between source code elements 
(e.g., inheritance, attribute access and method invocation). The 
Eclipse Java Development Tools (JDT) and the Eclipse 
Abstract Syntax Tree (AST) can be used to access, modify and 
read the elements of a Java program [37]. ASTs are broadly 
used in several fields of software engineering. AST is used as a 
representation of source code [25]. The extracted XML file 
contains all information needed to document the software by 
visualizing its code as a set of graphs. Figure 4 shows the 
mined XML file from the source code of drawing shapes 
software. 



IV. 



a \S Drawing shapes software 
* : *:rc 

a 0 Drawing. Shapes.coreElement: 

D- [X) MyLine.java 

[> [T] MyOval.java 

[> \J} MyRectangle.java 
a JJ} Drawing. Shapes.coreFrarne 

[> \J} DrawingShapes.java 

[> [7] MyShape.java 

[> JT) PaintJPanel.java 
t> Wk JR.E System Library [JavaSE-1.7] 

Figure 3. The drawing shapes software. 

The Object-oriented Software Documentation 
Process Step By Step 



This section describes the 00 software documentation 
process step by step. According to Vsound, the approach 
identifies the software documentation in four steps as detailed 
in the following. 



r <Project ProjectName= "Drawing shapes software" LinesOfCode= ,r 1500"> 
▼<Packages> 

► <Package PackageName= "Drawing" > . . .</Package> 

► <Package PackageName= 1, Drawing.Shape5">. . .</Package> 
▼ < Pack age PackageName=" Drawing. Shape 5 . coreElenients"> 

T<Classes> 

▼ <Class C la s sN ame= " My Line " clas sAcces5l_evel=" public") 
<SuperInterf aces/> 
<Attributes/> 
T<Methods> 

▼ < Method Method Name = "My Line" Me t hod AecessLevel= ir public " > 

► <Parameters NumberOfPa ramete rs="5 ir > . . . </Parameters> 

< Loca lVa riab 1 e s / > 

► <AttributeAccesses>. . .</AttributeActesse5> 

< Metho dInvocations/> 

< Metho dExc ept ions/> 
</Methcd> 

▼ < Method Metho d Name = " d ra w" MethodAccessl_e\/el="public 11 > 

► < Parameters Numbered 7 Pa ramete rs="l ir > . . . < /Parameter 5 > 

< Loca 1 Va riab 1 e s / > 

► <AttributeAccesses>. . . </ Attribute Accesses > 

► < Metho d I n voc a t i on s > . . . </Meth odln voc at io n s > 

< Metho dExc ept i ans / > 
</Methcd> 

</Methcds> 
</Class> 

► <Class Clas5Name="i , ''iyO\/al"> . . .</Class> 

► < Class ClassName="r- J iy Rent angle "> . . . </Class> 
</Classes> 

</Package> 

► <Package Pac k a geNanie= "Drawing . Shape s . coreF rame" > 
</Packages> 

</Project> 



. . </Package> 



Figure 4. The extracted XML file of drawing shapes software. 

B. Identifying the Package Document 

Vsound approach extracts several documents based on the 
software source code. These documents cover all granularity 
levels of the source code. To provide a better understanding of 
existing software, it is impossible to gather all information in 
one graph. Vsound approach provides one document (i.e., 
graph) for every granularity level (i.e., package, class, attribute, 
method). The package document aims to provide specific 
information about software. The package document contains 



Source code: http://code.google.eom/p/drawing-shapes-software/ 



4 Source code: https://code.google.eom/p/abstract-syntax-tree-vsound/ 
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information about software packages and quantitative 
information about the software (cf. Figure 5). 

Figure 5 shows the mined package document from drawing 
shapes software. This document provides information about 
software packages. As an example, from the graph in Figure 5, 
drawing shapes software consists of two packages, each one 
contains three classes. In addition, this document provides 
information about software metrics such as Lines of Code 
(LoC), Number of Packages (NoP), Number of Classes (NoC), 
Number of Attributes (No A) and Number of Methods (NoM). 
As an example, from the graph in Figure 5, drawing shapes 
software consists of 29 methods and 14 attributes. 

In order to understand the software source code, the 
package document is very useful. The goal of this document is 
to give a general view about software system and present the 
size of the software (i.e., large, medium or small system). 
Vsound approach applies to different sizes of software systems. 
The different complexity levels show the scalability of Vsound 
approach to dealing with such systems. 



Drawing shapes software 



Drawing.Shapes.coreElements 



MyLine 



MyOval 



MyRectangle 



Drawing.Shapes.coreFrame 



DrawingShapes 



MyShape 



PaintJPanel 



Number of Classes =3 



Number of Classes =3 



Lines of Code = 1500 
Number of Packages = 2 
Number of Classes = 6 
Number of Attributes =14 
Number of Methods = 29 



Figure 5. The package document mined from drawing shapes software. 

C. Identifying the Class Document 

The class document displays information about software 
classes. This information is very helpful toward understanding 
the software. Vsound approach identifies three documents 
belong to the class document category. The first document 
represents the class information document (cf. Figure 6). This 
document shows the number of classes per package. It also 
presents information about each class in the package. The 
class information document consists of the class name, super 
class name, is an interface, super interface name, number of 
attributes and number of methods. As an example, the class 
MyShape in Figure 6 consists of five attributes and 12 
methods. 



DrawingShapes 



Superclass: JFrame 



Islhter&ce: FALSE 



Number of Attributes 



Number of Methods =5 



MyShape 



Islnterface: FALSE 



Number of Attributes =5 



Number of Methods =12 



PaintJPanel 



Superclass: JPanel 



Islnterface: FALSE 



Number of Attributes 



Number of Methods =6 



Figure 6. Part of the class information document mined from drawing 
shapes software. 

The second document represents the class dependency 
document (cf. Figure 7). This document shows the main 
relations between software classes (i.e., Inheritance relation). 
As an example, the classes MyLine, MyOval and 
MyRectangle in Figure 7 have a super class called MyShape. 



Drawing_Shapes_Software 



Drawing_Shapes_coreElernents 




MyRectangle 




MyOval 




MyLine 















Drawh^aSh apes efcre 



PaintJPanel 



zftreFrarne 



MyShape 



DrawingShapes 



JPanel 



JFrarne 



Figure 7. The class dependency document mined from drawing shapes 
software. 



Legend: 



Attribute Name 



Data Type 



Method Name 



Returned Data Type 



PaintJPanel 



s eriaTV ersionUID 


longj 




shapesArraylist 


Array-List 




currentShape 


My Shape J 



PaintJPanel void 



s etCun entShap eTyp e void 



setCurrentColor void 



Figure 8. Part of the class content document mined from drawing shapes 
software. 

The third document represents the class content document 
(cf. Figure 8). This document shows the main content of each 
class. It also shows the size of class, where the height of class 
can considered as an indicator of class size. Vsound considers 
the main elements in the classes. The class content document 
includes the attribute name and its data type in addition to the 
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method name and its returned data type. As an example, the 
class PaintJPanel contains currentShape attribute of MyShape 
type. It also contains setCurrentShapeType method. 



D. Identifying the Method Document 

The method document shows information about software 
methods. This information is very useful toward understanding 
the software system. Vsound approach identifies three 
documents belong to the method document category. The first 
document represents the method information document (cf 
Figure 9). This document provides information about software 
methods. The method information document contains the 
following information: the name of a method, the method 
returned data type, is a static method, number of parameters 
and the parameter list (i.e., name, data type and order). As an 
example, the method MyRectangle in Figure 9 consists of 5 
parameters. 

The second document represents the method content 
document (cf Figure 10). This document shows the main 
content of each method. It also shows the size of method, 
where the height of method can considered as an indicator of 
method size. Vsound considers the main elements in the 
method body. The method content document includes the local 
variable name and its data type, attribute access name and its 
type and method invocation name and the declared class. As an 
example, method PainUPanel in Figure 10 contains 
addMouseListener which is a method invocation. 



MyRectangle 




Method Name: MyRectangle 


Returned Data Type: void 


Method Name: draw 


Is Static Method: false 


Rehuued Data Type: void 


Number of Parameters: 5 


Is Static Method: false 


Parameter [Name,Type]: [fiistX int] 


Number of Parameters: 1 


Parameter [Nanie,Type]: [firstY int] 


Parameter [Name,Type]: [g Graphics] 


Parameter [Name,Type]: [secondX int] 




Parameter [Nanie,Type]: [secondY int] 


Parameter [Name,Type]: [shapeColor Color] 



Figure 9. Part of the method information document mined from drawing 
shapes software. 

The third document represents the method dependency 
document (cf Figure 11). This document shows the main 
relations between software methods (i.e., method invocation 
and attribute access). As an example, the method "draw" that is 
declared in MyShape class is invoked by the paintComponent 
method in PaintJPanel class. Also, the currentShape attribute 
that declared at DrawingShapes class is accessed in 
painlJPanelMouseDragged method that declared at PaintJPanel 
class. 
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Figure 10. Part of the method content document mined from drawing shapes 

software. 
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Figure 1 1 . Part of the method dependency document mined from drawing 
shapes software. 



V. Experimentation 

This section presents the experiment that conducted to 
validate the Vsound approach. Firstly, this section presents the 
ArgoUML case study. Next, it presents the evaluation metrics. 
Then, it also presents detail the architecture and functioning of 
the Vsound prototype tool and, at last, it presents the software 
documentation results and threats to validity of Vsound 
approach. 
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A. Case Study 

In addition to the toy drawing shapes example used in this 



paper, Vsound approach has been tested on other software 
called ArgoUML. ArgoUML is a widely used open source tool 
for UML modeling tool. ArgoUML supports the following 
UML 1.4 diagram types: class diagram, state chart diagram, 
activity diagram, use case diagram, collaboration diagram, 
deployment diagram and sequence diagram. The advantage of 
using the ArgoUML as a case study is that ArgoUML software 
well documented. In this evaluation, the results are based on 
the source code of the software that is freely available for 
downloading in the case study website 5 . ArgoUML runs on any 
Java platform and is available in ten languages. ArgoUML 
software is presented in Table 1 characterized by metrics LoC, 
NoP, NoC, NoA and NoM. 
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document (i.e., graph). For doing so, the Vsound_GUI 
component of Vsound uses an external library, called 
Graphviz 7 to produce SVG files containing the documentation 
of each software artefact (i.e., package, class and method). 
SVG (Scalable Vector Graphics) is an XML-based file format 
for describing vector graphics [28]. 



TABLE I. 



Size metrics for ArgoUML software system. 



Software 


LoC 


NoP 


NoC 


NoA 


NoM 


ArgoUML 


120,348 


81 


1,666 


3977 


14904 



B. Evaluation Measures 

In order to evaluate Vsound approach, precision and recall 
measures are used. In this paper, the precision is the percentage 
of correctly retrieved links to the total number of retrieved 
links. The recall is the percentage of correctly retrieved links to 
the total number of relevant links [36]. In this work, link 
means: source code element (package, class, attribute, method 
or local variable) or dependency (inheritance, method 
invocation or attribute access). All measures (i.e., precision and 
recall) have values in [0, 1]. If precision is equal to one, every 
one of retrieved links is relevant. However, relevant links 
might not be retrieved. If recall is equal to one, all relevant 
links are retrieved. Nevertheless, some retrieved links might 
not be relevant. For example, by considering software system 
contains 95 relations (i.e., inheritance relation). After applying 
the proposed approach on this software, the result shows that 
90 relations are identified correctly (i.e., all retrieved links are 
relevant). However, five relations are missing. In this case, 
precision is equal to 1 (i.e., 90/90=1) and recall is equal to 0.94 
(i.e., 90/95=0.94). 

C. A Simplified Structural View of the Architecture of 
Vsound 

The developed prototype tool 6 of Vsound approach 
implements the proposed software documentation process. 
Figure 10 provides an overview of the Vsound tool 
architecture. It receives as input the software source code and 
produces as output a set of documents. The XML_Generator 
module of Vsound produces an XML document which 
represents the software source code elements and the 
dependencies between those elements. The tool then starts to 
generate the software documents by using the 
DOT_File_Builder component. These documents are serialized 
as DOT files. DOT is a plain text graph description language. 
Starting from these DOT files, the tool builds the software 



5 ArgoUML: http://argouml.tigris.org/ 

6 Vsound source code: https://code.google.com/p/fidd/issues/detail?id=2 
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Figure 12. A simplified structural view of the architecture of Vsound. 
D. Result 

Vsound approach has tested on the ArgoUML case study 
and obtained promising results. The preliminary evaluation of 
Vsound shows the significance of this approach. Vsound 
approach extracted a collection of documents from the source 
code of ArgoUML software. The first document represents the 
package document which represents general information about 
ArgoUML such as: number of packages, number of classes per 
each package and quantitative information about ArgoUML 
like LoC and NoM. The second document represents the class 
document (i.e., the class information document, the class 
dependency document and the class content document). The 
identified documents by Vsound provide useful information 
about software classes. The third document represents the 
method document (i.e., the method information document, the 
method content document and the method dependency 
document). This document shows meaningful information 
about the software methods. Results show that precision 
appears to be high for all mined documents from ArgoUML 
software. This means that all mined documents grouped as 
software documentation are relevant. Considering the recall 
metric, recall is also quite high. This means that most source 
code elements and their dependencies that compose software 
documentation are mined. Thanks to Vsound approach that 
identifies software documentation in a novel way. 



7 http://www.graphviz.org/ 
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E. Threats to Validity 

Vsound approach considers only the Java software systems. 
This represents a threat to prototype validity that limits Vsound 
implementation ability to deal only with software systems that 
are developed based on the Java language. Vsound assumes 
that source code elements and the dependencies between those 
elements can be determined statically, such as ArgoUML used 
in Vsound evaluation. Nevertheless, there exist systems that 
only behave differently depending on runtime parameters. For 
such systems, there is a need to extend Vsound to include 
dynamic analysis techniques. Vsound approach assumes that 
software documentation can be determined graphically as a set 
of documents based on the source code. In some cases, there is 
a need to document software systems by describing their 
functionalities. This means that Vsound maybe not reliable (or 
should be improved with other techniques) in all cases to 
identify software documentation. Documenting software using 
the names of source code elements and dependencies in its 
implementation is not always reliable. Thus, it is important to 
use the comments {i.e., line and block comment) in the source 
code in order to enhance the documentation process. 

VI. Related Work 

This section presents the related work relevant to Vsound 
contribution. It also provides a concise overview of the 
different approaches and shows the need to propose Vsound 
approach. 

Wettel et al, [11] proposed an approach to visualize object 
oriented software as a city to solve the navigation problem. 
CodeCity is a visualization tool that represents the software 
with a city metaphor. The classes are represented as buildings 
and the packages as districts. Moreover, some of the visual 
properties {i.e., width, height, position, and color) of the city 
artifacts carry information about the software element they 
represent {e.g., the height of the building represents the number 
of methods). Their approach does not consider dependencies 
between source code elements. 

Hammad et al, [32] used software visualization techniques 
to visualize class coupling based on analyzing the source code 
statically. Other works such as [33] and [34] applied software 
visualization to model the dynamic behavior of the software by 
instrumenting the source code in order to monitor the program 
executions. Moreover, visualization techniques can be applied 
on software documentation to make it easier and more helpful. 
For example, work in [35] proposed a visualization framework 
to visualize bug reports that are saved in software repositories. 
The proposed framework visualized the status changing for 
selected bugs, as well as, bug-developer relations by using 
different shapes and colors. 

Al-msie'deen et al, [6] [9] [29] [30] proposed an approach 
called REVPLINE to identify and document features from the 
object-oriented source code of a collection of software product 
variants. The authors presented a new way to document the 
mined feature implementations in [19]. The proposed approach 
gives as output for each feature implementation, a name and 
description based on the use-case name and description [7] [8]. 
REVPLINE approach aims to document the extracted features 
from a collection of software variants, while Vsound aims to 
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document 00 software as a set of graphs. Hammad et al, [31] 
presented an approach that focuses on analyzing code changes 
to automatically detect any 00 constraints violation without 
using graphs as in Vsound approach. 

Graham et al, [26] used a solar system metaphor to 
represent the software source code. The main code entities {i.e., 
packages and classes) are represented as planets encoding 
software metrics in the planets' size and color. This 
visualization represents the software as a virtual galaxy consists 
of many solar systems. Each solar system represents a package. 
The central star of each solar system represents the package 
itself, while the planets, in orbit around it, represent classes 
within the package. For planets, the blue planets represent 
classes while the light blue planets represent interfaces. Solar 
systems are shown as a circular formation. The Solar System 
metaphor represents packages and classes of software without 
considering methods and their body. In addition, it focuses only 
on the inheritance relation between classes without considering 
the attribute access and method invocation relations. 

McBurney and McMillan [12] create summaries of Java 
methods by using local information {i.e., keywords in the 
method) and contextual information {i.e., keywords in the most 
important referenced methods). The method's summaries are 
generated from elements in the method's signature and body. 
The approach applied only to the method without considering 
other artifacts {e.g., package and class). The proposed approach 
used natural language processing and information retrieval 
techniques. The type of summary classified under abstract 
Summary. 

Haiduc et al, [27] proposed an approach for summarizing 
the software source code. The source code summaries consider 
only the software methods and classes. The proposed approach 
used information retrieval techniques such as latent semantic 
indexing and vector space model. The approach extracts the 
text from the source code of the software and converts it into a 
corpus {i.e., source code corpus creation). Then, the approach 
determines the most relevant terms for documents in the corpus 
and includes them in the summary {i.e., generating source code 
summaries using text retrieval). Their approach does not 
exploit structural information from the source code. 

VII. Conclusions and Future Work 

This paper proposed a new approach for documenting and 
understanding the software system. The novelty of this 
approach is that it exploits code dependencies and quantitative 
information in source code to document 00 software in an 
efficient way by means of a set of documents. The Vsound 
approach has implemented and evaluated its results on several 
case studies. The results of this evaluation showed that most of 
the software systems were documented correctly. Regarding 
future work, Vsound approach plans to automatically generate 
descriptive comments that summarize all software packages, 
classes and methods. It also plans to use the line and block 
comments in the documentation process. 
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Abstract: As a matter of fact, the rising awareness 
in organisations to not focus only in the physical 
assets in the organisation but also in other 
paramount assets such as knowledge was the 
launching point for organisations to go to the 
implementation of knowledge management and 
creating knowledge management strategies codified 
or personalized in order to overcome issues and 
plan for its future sustainability and development. 
Actually, the implementation of knowledge 
management strategy is the strategic consulting 
company internal asset and external mean to get 
value. Although, after the implementation of 
knowledge management strategy in some 
companies they face the problem of the low 
revenue and efficiency of the strategy 
implemented. That's why the combination of the 
codification and personalization make the 
organisation able to create, share, save and re-use 
knowledge and exchange it in the organisation 
using technology or informal communication 
networks in order to raise the innovation in the 
work environment and the creation of new ideas 
and services. Also, this knowledge stored and 
codified especially about the organization 
customers can be considered as the basic 
knowledge to take the strategic decisions and by 
KM the organisation will be able to extract, 
understand and well use this knowledge in order to 
strengthen the relation with its customers and allow 
them to become the organisation's partners and 
advisors. Indeed, that's mean that the organisation 
strategy must be renewable and in corporation with 
the overall business model and focus on the core 
competence in the organisation and the 
management of the customer knowledge and 
interactions, this will influence positively and 



upgrade the organisational performance in any 
organisation. 

Key words: Knowledge management, Strategy, 
Consulting firms, organisational performance 

I-Introduction 

Nowadays, there is a significant awareness in 
all the modern organisations and it start to 
recognise and know the importance of considering 
the knowledge as a valuable asset in the 
organisation that can be managed. [1][2] Actually, 
knowledge can be divided into two types the 
explicit knowledge which is the documented and 
codified knowledge and in this kind of knowledge 
information technology play an important role to 
share and save this knowledge by the use of 
systems and the latest technology in the 
organisation. Furthermore, the second type is the 
tacit knowledge which is the knowledge that come 
by experience in work, intuition and maybe we do 
not know that we have it or how to explain it, this 
kind of knowledge can be shared by the interaction 
and the well communication environment between 
workers in the organisation. [3][4] From this 
understanding of knowledge we can say that in a 
learner and smart organisation, knowledge 
management is considered as the combination of 
people, processes, tools and technology to acquire, 
learn, create, renovate and share knowledge in 
order to optimise the employment and the well use 
of this knowledge in the organisation in order to 
increase the organisational performance and 
ensuring it's sustainability. [1][5][6][7][8] 

In fact, the consulting firms had the precedence to 
invest in knowledge management. Furthermore, 
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there are different strategies to involve the 
knowledge management within the consulting 
companies or any other organisation. The strategies 
are the codification strategy based on computers 
and databases to save the knowledge and second 
strategy is the personalization strategy based on 
communication among people in work. [9][10] Indeed, 
any organisation must know and have a clear vision 
why it need to develop a knowledge management 
strategy what is the goal of this implementation, in 
what it can help the organisation to be developed 
and it should be aligned with the organisation 
overall strategy. In this situation the consulting 
firms play an important role to help the 
organisation to identify clearly what are the 
weaknesses, the issues in the organisation that need 
to be resolved and to focus on helping the 
organisation to keep and attract customers by 
implementing a knowledge management strategy 
which will be related to the global long-term vision 
of the organisation and supply a framework that 
will help the organisation to resolve and overcome 
those issues and increasing the business and 
organisational performance. [2][9][11] 

II-Literature review 

In fact, consulting firms are the more 
concerned by knowledge because it represent an 
asset, resource, product or even a service which 
help the consulting firm to create value and benefit, 
that's why they need a strategy to manage this 
knowledge and be able to sale it as their business. 
The authors Sven, Dirk, Dietrich and Chiara did an 
analysis on the correlation between the knowledge 
management strategy and the business model of the 
organisation. Actually, the management consulting 
firms provide advice to the organisations relating to 
the strategy, operations or information technology 
based on knowledge to add value to the 
organisation and to improve their performance. 
Commonly the consulting firms forward one of two 
business models, the first one is the creation of high 
customised solutions by the firm for a significant 
and unique issues by providing original and 
analytical advices based on experience and tacit 
knowledge ,this kind of strategic consulting firms 
business model focus will be to preserve a high 
profit for the organisation, whilst the second 
business model followed by the consulting firms is 
the providing of a high standardised products and 
services due to their dealing with the same issues 
thus they reuse their existing modules and they 
continue to create new modules and pieces in order 
to generate great revenues. [12] 

In the other hand in consulting firms the knowledge 
management strategy follow specific goals, 
techniques and technology and the generation, 
distribution, maintenance of knowledge in each 
process is controlled centrally which mean the 



knowledge will be codified, stored in data bases 
and documented or de- centrally which mean a 
personalized knowledge attached to the person who 
acquire it. Then the authors consider that the firms 
which has a standardized business model should 
focus on the central knowledge management 
strategy, whereas the consulting firms that provide 
the customised solutions should focus on the de- 
central knowledge management strategy because in 
this case the codified knowledge will have a limited 
value for the organisation and they need to utilize 
the tacit knowledge and experience to innovate 
solutions. Furthermore, this analysis was proved 
after studying the business model and knowledge 
management strategy in the case of leading 
companies each one in a specific sector such as 
Mckinsey, Accenture, Price water house Coopers, 
and prognos AG. [12] 

Another study analyse the two types of 
strategies the codification and personalization from 
a marketing perspective to optimize the efficiency 
of knowledge reuse. Actually, one of the issues is 
that many organisation suffer due to the depressed 
returns from the investment in knowledge 
management. Thus, to significantly increase the 
efficiency of transferring knowledge among 
consumers and producers, we should first know 
that the inefficiency of knowledge transfer is due to 
the diverse priorities and agendas of the producers 
and consumers pending the knowledge exchange 
and share. Indeed, the authors after developing a 
model to help in maximizing the efficiency of 
knowledge reuse and transfer they found that the 
two strategies codification and personalization 
should be combined together to enhance the 
efficiency of knowledge reuse. [13] 
From another point of view, a study and research 
was done by Zhu Yu, Wang Yan-fei and Lan Hai- 
lin with the participation of 223 enterprises in 
China to recognize the relation between the 
knowledge management strategy, the core 
competences and the organizational performance. 

[14] 

Actually the study found that, firstly for an 
organisational future development there should be a 
coordination between the knowledge management 
strategy and the core competence in the 
organisation. It's because the core competence 
represent the intermediate to create an impact and 
effect of the knowledge management strategy on 
the organizational performance and the 
organisation should pay more attention and focus in 
the developing of their core competences as well as 
their strategy to ensure a long- term development 
and sustainability. Secondly, the researches 
demonstrate that the greatest impact on human 
resource competence and efficiency is from the 
knowledge management strategy and the core 
competence in the organisation. In addition, the 
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knowledge sharing culture has to be developed in 
any organisation. Actually, by the implementation 
of a codified knowledge management strategy the 
organisation will build a sharing mechanism by the 
capture, save and reuse of knowledge and it start by 
the spread, educate and aware employees to better 
understand what is the sharing culture and how it 
represent a benefit for the employees and for the 
organisation. L J 

To resume the study found that the two knowledge 
management strategies have different impact and 
influence in the organisation, actually the 
personalization strategy has a positive impact on 
the core competence while the codification 
knowledge management strategy has a positive 
impact in both the core competence and the 
organizational performance. [14] 

Actually we cannot forget the significance of 
technology in the improvement and development of 
any organisation, from this standpoint Barney, 
Shan and Ray had did an analysis study on a 
successful case in Singapore to see how in 
specification the web technologies can improve the 
organisational performance in dependence with the 
global organisational environment. In fact, the 
authors found that the web technologies can play an 
important role to assist the organisation business 
strategy and optimizing its performance. 
Furthermore, to have an effective influence of web 
technologies in the organisation the organisation 
should not depend on the technical concession but 
in the complicated fit among the strategy, the 
technology and the external environment. Indeed, 
the equilibrium in the organisation environment 
help the web technology to improve the 
organizational performance by simplifying the 
realization of competitive features using three 
different mechanisms "the logics of positioning, 
leverage, and opportunity", reciprocally in the 
revolution state of the organisation environment 
"Web technologies can give rise to performance 
gains by supporting the attainment of legitimacy 
through two distinct mechanisms: the logics of 
optimality and social congruence" . [15] 
In the same context, another research has shown 
how the combination between the technology and 
the knowledge management can be a tool to 
increase the performance and the profit of the 
organisation which is the use of customer 
relationship management (CRM) in integration 
with knowledge management. Actually CRM as an 
approach based on strategies and technology help 
the organisation to ameliorate its business 
relationship with customers by collecting 
information about their customers through various 
points of contact among the organisation and its 
customers such as the social media, organisation's 
website, email, call centre and different marketing 
tools. Thus to achieve this the organisations use 



different software's to store all the customer 
information into a single database to record the 
customer interactions and the automation of 
workflow processes. But, this information stored in 
big database cannot be significant and play a role to 
maintain permanent customers and increase the 
production and long-term profit in the organisation 
if this information is not well managed, organized, 
connected and distributed and here the knowledge 
management can play a role to extract the 
meaningful knowledge from those information to 
transform it into valuable information that can be 
analysed to attract customer and improving the 
business performance in the organisation. Indeed, 
the combination of KM and CRM can help to use 
the knowledge for, from and about customers by 
the experts in the organisation in order to attain the 
organization goals and optimizing its business and 
organizational performance . [ 16] [17] 

One of the successful organisations IBM 
traditionally was known as a company that have a 
profound experience in information technology and 
in their old business model the company was 
depending on the sale of hardware to get revenues. 
However, IBM has recognised that the 
organisations start to give more attention to the 
strategic value in information technology tools and 
re-engineering projects and those projects should 
be linked with the overall business strategy of the 
organisation. For this, in order to keep up and react 
with the business environment IBM has created the 
IBM global services business unit, then the largest 
acquisition on the IBM history was done by 
acquiring the Price Waterhouse Coopers (PwC) 
which is a consulting firm that provide information 
technology services and IT management 
consulting. Furthermore, IBM had create a new 
unit which is IBM business consulting services to 
combine the global services business unit and the 
PWC consulting. [18] 

Actually, with a very large number of 
consultants in different countries IBM has become 
one of the largest consulting services organisations. 
In the other hand, IBM consulting has involve the 
intellectual capital management (ICM) in order to 
formalize the knowledge management over the 
IBM services and industries and give more 
attention to the acquire, creation, sharing, using and 
transferring knowledge in order to ensure the 
development of the organisation. [18] 

In fact, the knowledge management strategy 
of IBM included the linking between the strategy 
and the intellectual capital of the company, the 
creation of a culture based on knowledge, creating 
processes and infrastructure that help to crate and 
share knowledge, using technology for sharing 
knowledge, and the measurement of the intellectual 
asset sharing effectiveness. Also, one of the IBM 
priority was to raise the capability of the 



30 



http://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



(IJCSIS) International Journal of Computer Science and Information Security, 
Vol. 13, No. 5, May 2015 



consultants group to share their knowledge and be 
able to merge it rapidly in response to the client 
need by the creation of informal networks as a way 
to maximize the internal team work liberty to act 
and share knowledge. Thus, the knowledge 
management in IBM was based on the society 
network and what differentiates the process of 
knowledge management in IBM is the use of the 
SNA (social network analysis) as a tool to analyse 
and check the knowledge network characteristics 
and as a change management initiative. [18] 

Furthermore, IBM had realize that the real 
value come from crating and sharing knowledge, 
and actually the valuable knowledge of the 
company come from the heads of the talented 
employees. Indeed, IBM in order to ensure its 
sustainability an maintain a reputable brand in 
consulting need to merge the two knowledge 
management strategies the top down centralized 
and the bottom-up decentralized approaches in 
order to build trust in the work environment, 
exchanging talents internally as the case of 
Mckinsey consulting firm where the consultants 
nominate themselves for a specific project and then 
they make the manager bid for them, actually this 
exchange mechanism help the best employees to 
distinguish their experience and knowledge. [18] 
As a matter of fact, IBM is known as an ongoing 
strategic renewal company as mentioned by the 
authors Edna and Tuvya. The success of IBM all 
over the years was based on a strategic conjunction 
of innovation in business management and 
technology. Actually, in the level of technological 
innovation IBM is known as one of the company 
that have the largest research and development set 
and the success of this set is due to the management 
of employees and their knowledge. Furthermore, 
due to the research and development set, IBM 
invented as a business model the employment of 
personal computers, then the mobile computing and 
the think Pad, after this IBM has made a great 
transformation by focusing on providing consulting 
to clients. Thus, IBM was a company that make 
many transformations according to the market 
changing and needs. [19] 

In the level of business management innovation, 
IBM had a long term vision that was the cause of 
its success today as a company that supply services 
and solutions, hardware, software with a very 
profound knowledge in the different fields to which 



the company supply its services. Also, IBM 
initializes the internal innovation in the company 
by focusing and giving importance to processes 
such as the global brainstorming. Actually, 
innovation is one of the important sources which is 
extracted from the internal communication network 
in the organisation that regroup a very large 
number of customers , employees and business 
partners in order to share the new services, ides, 
conducting debates and much more. [19] 

III- Literature findings 

The general awareness in organisations from 
different fields and sectors on the importance to 
consider knowledge as a valuable asset, give raise 
to the important role of strategic consulting firms 
which are themselves need the knowledge 
internally as an asset and resource, to create 
product or a service in order to get value and 
benefit, or externally to sale it as their business by 
implementing strategies to manage knowledge in 
other organisation from different fields. 
Actually, there is two types of knowledge 
management strategies used within and through the 
consulting companies which are the codification 
and personalization strategies. Indeed the 
implementation and the success of the knowledge 
management is by the combination of this two 
strategies which mean that the knowledge should 
be saved and codified to simplify its reuse and in 
the same time it should be exchanged through the 
personalized strategy. In addition, this combination 
of two strategies should be related to other 
components of the organisation such as the 
technology role, the business model role, and the 
importance of core competences and the creation of 
a sharing culture, the importance of the intellectual 
capital and its assessment and the rapid adaption to 
the market change. 

As a result, the combination and the alignment of 
the components mentioned before with the 
knowledge management strategies can play an 
important role to increase the organizational and 
business performance significantly and ensure the 
development and the sustainability and help it to 
keep a high level of their customer satisfaction of 
the organisation such as the successful case of IBM 
Company. 
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In the figure above we see how the strategic 
consulting firms use and implement the knowledge 
management strategies in addition and alignment 
with the business model, by focusing in the 
innovation in the organisation and the use of 
technologies such as the web technology and the 
customer relationship management systems. All of 
these components together lead to the increase in 
organisational performance in the consulting firm 
itself and in the other organisations that the 
consulting firms help to improve their situation and 
resolve their issues. 



IV-Conclusion 

In short,the knowledge management strategy help 
the organisation to analyse its need, issues, and 
innovate solutions and plans for the development of 
the company and its sustainability by reachinga 



high level of satisfaction from its customers and 
increasing the interactions with them by using 
technology and systems or by the direct 
communications and interviews which make the 
customers partners in defining the organisation 
future, and this is the goal of consulting firms in the 
implementation of a knowledge management 
strategy in an organisation. Actually, the successful 
implementation of knowledge management strategy 
in consulting firms or by consulting firm to another 
company play a major role in contribution with 
other components such as the business model to 
have a significant potential increase of the 
organisational performance in the organisation. 
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Abstract — This paper presents application of Discrete Fourier 
Transform(DFT) attack on Stream Cipher Welch Gong(WG)-7. 
WG-7 is a lightweight, hardware oriented stream cipher that 
uses a word oriented Linear feed back shift register(LFSR) and 
a nonlinear WG transformation that acts on the LFSR output 
word. The cipher has been designed to work in a resource 
constrained environment by Yiyuan Luo, Qi Chai, Guang Gong 
and Xuejia Lai as variant of original WG Cipher. This paper 
aims at faster recovery of keystreams than predicted complexity 
of the DFT attack by the designers. The proposed DFT attack 
recovers the keystream with a time complexity of at the most 
2 22 and key bits complexity of at the most 2 19 by employing the 
annihilator in the structure of the cipher. The proposed attack 
is even efficient than Algebraic attack and Fast Algebraic attack 
on the cipher. 

Index Terms — WG-7, Discrete Fourier Transform Attack, Key 
Recovery Attack, Discrete Fourier Spectra Attack. 

I. Introduction 

Ronjom and Helleseth [1] introduced the New Attack on 
the Filter Generator, to recover the initial state of a filtering 
sequence generator. The attack is efficient than the Fast Al- 
gebraic Attack(FAA), but needs more key bits as compared to 
FAA. Ronjom and Helleseth [2] extended the same attack [1] 
to the case of filter generators over finite fields F 2 n , consisting 
of a primitive feedback polynomial over F 2 n generating a 
sequence that is filtered through a nonlinear Boolean func- 
tion. Also results were presented on attacking WG Cipher 
proposed by Yasir et al. [3]. Ronjom et al. [4] made the 
linear subspace attack application generic by forming a system 
of linear equations over F 2 n instead of F 2 utilizing filtered 
sequences with their trace representation. Gong [5] showed a 
fast computation of DFT using the selective DFT algorithm 
resulting in simplifying selective DFT attack. Gong et al. [6] 
introduced the new variation known as the fast selective DFT 
attack, which is strongly related to the FAA and algebraic 
attack, however, fast selective DFT attack is more efficient 
than FAA when the captured number of consecutive bits of 
the cipher is less than the linear complexity of the sequence. 
DFT attack has been further improved by Wang et. al. [7] as 
it relaxes the successive keystream sequence requirement and 
carries out DFT operations in subfield rather than extension 
field. The attack is applied to a version of Bluetooth encryption 
algorithm Eq by Wang et. al. [8]. Wang et. al. [9] also 
established the fact that DFT attack can recover any keystream 



by using at most half of the cipher period. 

WG-7 [10] is a lightweight stream cipher whose design 
is mainly inspired by the WG stream cipher [3]. WG-7 is 
hardware oriented stream cipher that uses a word oriented Lin- 
ear feed back shift register(LFSR) and a WG transformation. 
It consists of a 23 stage LFSR over F 2 i and a WG linear 
transformation. The cipher has been designed to work in a 
resource constrained environment. The authors of the cipher 
have analyzed the cipher design and concluded that this cipher 
is secure against DFT attack. However, this evaluation are 
not in-depth especially after the discovery of Annihilator by 
Muhammad Ali et. al. [11]. The annihilator was discovered 
to reduce the complexity of the Algebraic attack on WG-7. 
The same annihilator is used in our proposed attack to depict 
the possibility of DFT attack against WG-7 cipher. The Fast 
Selective DFT attack [6] version has been employed which 
is also referred in literature as Fast Discrete Spectra Attack. 
The proposed attack reduces the complexity of the attack by 
placing heavy computation during offline phase. Later on it is 
shown that proposed DFT attack is more efficient in terms of 
keybits requirement and online computation when compared 
with Algebraic and Fast Algebraic attack on the WG-7 Cipher. 

The paper is organized as follows. Section II gives a brief 
description of the WG-7 Cipher, Section III gives a brief 
description of the Fast Selective DFT attack. Then section IV 
describes the proposed DFT attack. 

II. Description of WG-7 

WG-7 [10] is designed to generate up to 2 24 bits of key 
stream from an 80-bit key length and an 81 -bit initialization 
vector(IV). The internal state of 161 bits [si,--- , Si6i] is 
divided into one LFSR of length 23, containing 7-bit words 
each. WG-7 is hardware oriented stream cipher that uses a 
word oriented Linear feed back shift register(LFSR) and a WG 
transformation. It consists of a 23-stage LFSR over F 2 i and 
a WG linear transformation. It has the ideal two-level auto- 
correlation property. The non-linear transformation is defined 
by the equation l.The cipher works by loading the secret 
key and IV to initialize the internal state of the LFSR. Then 
clocking of the LFSR is carried out 46 times with its non-linear 
update. After this the cipher generates the required keystreams. 
The internal state of the cipher is 161 bits which are used to 
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generate 2 24 bits of key streams. 



WG7(x) = Tr(^ + x y + x 2i +x b{ + x*'),x e F 2 ? . (1) 

Yiyuan et al. [10] have concluded that DFT attack is impos- 
sible to be launched against stream cipher WG-7. The attack 
calculations are based on complexity computations given by 
Ronjom et al. [2]. The complexity of the attack has been 
found out to be 2 25,5 key stream bits after a precomputation 
with a complexity of 0(2 39,5 ). If the adversary obtains the 
keybits less than 2 25 , the adversary has to guess 2 23,5 unknown 
bits to launch the DFT attack. The cipher designers have 
concluded that best against the WG-7 is the exhaustive search 
as adversary cannot obtain 2 24 consecutive keystream bits. 

III. Discrete Fourier Transform Attack 

Two DFT attack versions have been defined by Gong et. 
al. [6]. Selection between the two versions requires the exact 
linear span of the cipher. If the linear span is high, the selective 
DFT method cannot be used. However, fast selective DFT 
Attack can be used in this situation which reduces the linear 
span by multiplying essentially another sequence having lower 
linear span than the original sequence. The adversary can 
launch DFT attack to recover the internal state and then the 
cipher is clocked backwards to recover the secret key. The 
fast selective DFT attack is described here in the subsequent 
section. 

A. Fast Selective DFT Attack 

The algorithm is description of Algorithm 2 given by Gong 
et al.in [6]. It recovers the factor f3 where the number of 
observed consecutive bits of a filter generator is less than the 
linear complexity of the sequence. The fast selective method 
is used to recover the initial state of the cipher. The bits where 
the output sequence is 1 are filtered and their relative index are 
then used to develop the system of equations. The trace form 
of the polynomial is used to develop the systems of equations. 
The trace computation is also carried out to find the elements 
of subfield F 2 from F 2 n. The system of equation is solved 
using any known methods to recover (3. 

1 ) Offline Computation: The off-line computation includes 
the selection of annihilator function and subsequent conversion 
to its DFT form(Trace Form) for recovery of keybits in online 
phase. The major computation is carried out in this phase as 
a one time measure. First selection of a new sequence g = g t 
is carried out with the condition that the original sequence u t 
and annihilator sequence g t product is 0. The new sequence 
also satisfy the following condition that: H g < l(u), where 
H g = {k\Gk 7^ 0, k is a coset leader mod H}, H is the 
complete period of the cipher .The computation of the Gk 
will be carried out using the method described by Gong et al. 
[4]. 

2) Online Computation: In online computation, all k will 
be replaced with t where keystream are u t = 1. Function 
vt will be used to perform the procedure for recovery of 
/? from equation v t = ^ZkeH Tr(f3GkCt k ) = 0. Thereafter 



the coefficient matrix is computed by replacing G]^cx k with 
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The independent system of equations is generated which 
produces a unique solution. Gaussian Elimination method can 
be used to solve this system of equations. To recover bits after 
initialization stage, above system of 1(g) — 1 linear equations 
over F 2 with 1(g) — 1 variables are required to be solved. 
If there is a k G H g with gcd (k,N) = 1, then (3 will be 

extracted by /3 = f3 k , where k' = k~ r (mod TV). otherwise, 
{/3 k \k G H g } values will be returned. In this Algorithm, the 
number of required consecutive bits from u t is 1(g). 

3) Complexity Calculations: The complexity calculations 
are same as mentioned in [6]. The preprocessing phase is a 
one-time effort like Fast Algebraic Attack. The complexity 
for the online phase of the attack is calculated for case 
where product of original function sequence and Annihilator 
sequence is equal to zero. 

IV. Proposed Fast Selective DFT Attack on WG-7 

This section describes our proposed attack on WG-7 to 
reduce its complexity on WG-7. The algebraic normal form 
(ANF) of non-linear transformation acting on last 7 bits of 
the cipher is given by Mohammad Ali et al. in [11]. The best 
annihilator found out by Mohammad Ali et al. [11] of non- 
linear transformation function g is given as equation 3. 



h(y 1 ,...,y 7 ) = 1 + 2/1 + 2/3 + 2/12/22/3 + 2/4 + 2/12/4+ 

2/22/4 + 2/12/22/4 + 2/32/4 + 2/42/5 + 2/12/42/5 + 2/32/42/5 + 
2/6 + 2/12/6 + 2/22/6 + 2/12/22/6 + 2/32/6 + 2/22/32/6 + 2/22/32/7 + 
2/42/7 + 2/22/42/7 + 2/32/42/7 + 2/32/52/7 + 2/42/52/7 + 2/62/7 + 
2/12/62/7 + 2/22/62/7 + 2/32/62/7 + 2/12/32/4 + 2/22/32/4 + 2/7 + 



(3) 



2/12/62/7 + 2/22/62/7 

+ 2/32/7 + 2/12/32/7 + 2/12/32/5 

The application of Fast Selective DFT Attack proposed by 
Gong et. al. [6] on WG-7 cipher cannot achieve considerable 
reduction in complexity of the attack due to structure of 
the cipher (WG-7). Another limitation it poses is due to 
requirement of keystream in succession regardless of 1 and 
0's for success of the attack. Also the structure of the word 
oriented cipher hinders in computation of the DFT for the 
complete period (2 161 — 1) and the primitive polynomial which 
completely defines this period. This method transforms the 
structure of LFSR of WG-7 into an equivalent polynomial 
form of degree 161 before application of the filter function. 
A primitive polynomial will be derived to define the complete 
period of the WG-7 cipher before application of filter function. 
The polynomial derived as a result will be used to calculate 
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DFT form of the annihilator and original function.This devised 
method achieves significant decrease in complexity than DFT 
attack showed by Yiyuan et.al. [10] and employs Fast Selective 
DFT Attack [6]. To further illustrate this procedure, a simu- 
lated cipher similar in structure to WG-7 has been used. Two 
phases of the attack will be utilized to perform this method, 
offline and online. 

A. Offline computation 

In offline phase, polynomial form of the cipher LFSR 
structure will be developed which generates all the elements of 
the field GF(2 161 ). For this purpose, two outputs of the cipher 
will be taken simultaneously. First output will be generated 
without the original filter function given as 1 and second with 
annihilator given as 3. The first output of the LFSR stage of the 
cipher will generate 2 161 bits. The bits will be used to calculate 
the primitive polynomial of the cipher using the well known 
Berkeley and Mason (BM) algorithm. The polynomial will 
be primitive in GF(2 n ) where n= 161. The second output of 
the cipher using annihilator from equation 3 will be generated 
and stored with initial state set as (0, 0, 0, 0, 0, 0, a 128 ). The 
polynomial of GF(2 n ) calculated above will be used to 
calculate DFT coefficients (Trace form) using the annihilator' s 
output. The selective DFT computation method described by 
Gong et al. [4] will be used to compute DFT. Cyclotomic 
cosets leaders will be utilized to represent the equivalent Trace 
form of the annihilator for launching fast selective DFT attack 
as described by Gong et al. [6]. The major computation of the 
attack will be carried out in precomputation phase and it is a 
one time effort. 

B. Online Computation 

Significant decrease in linear span of the cipher will occur 
due to use of Annihilator. This also reduces the number of 
required linear independent equations. The system of equation 
will be generated where all captured keystream are 1 to recover 
the initial state of the filter Generator (WG-7) as per method 
already described in section III. The system of equation will be 
solved to generate the keybits for recovering the initial state 
of the cipher. Complexity of the attack is mentioned in the 
Table I. 

C Application of DFT Attack on Simulated Filter Generator 

A filter generator has been introduced to imitate the pro- 
posed method. The relevant details are given in succeeding 
subsections :- 

1) Specification of Filter Generator: Filter generator is 
designed to generate up to 2 10 — 1 bits of keystream from 
an 10 bit key length. The internal state of 10 bits [sq, • • • sq] 
is divided into one LFSRs of length 2, containing 5 bit word 
each. The cipher consists of a 2 stage LFSR over F(2 5 ) and a 
linear transformation. The finite field is defined by primitive 



polynomial d(x) 



-x 6 + 1. The characteristic polynomial 



is primitive over F(2 5 ) and is given by f(x) = x 2 + x + a 7 , 
where a 7 is root of d(x). The non-linear transformation is 
defined by equation 4:- 



u(xo, • • • X4) = X 0 + XqXi + X 0 X 2 + X0XXX2 + £3 + 

XIX3 + X2X3 + XIX2X3 + £4 + X0X4 + XIX4 + 

X2X4 + X0X2X4 + X1X2X4 + X0X3X4 . (4) 

2) Offline Mode Computation: In offline mode, two outputs 
of the word oriented LFSR are taken. The period of the cipher 
is 2 10 - 1 = 1023, as n=5x2=10. One output is without its 
output filter function u(x) 4 for the complete period. The other 
is with the use of annihilator mentioned as equation 5. Both 
the outputs are stored with a storage requirement of 2 11 . 

g(xo, ...,£ 4 ) = X2+X0X2+X1X2+X0X3+X1X3+X1X4+X3X4. 

(5) 

The output of the LFSR word is XOR in order to convert 
it from F 2 5 to F2. If all the bits can be represented as 
[#o, • • • x $]> m en XOR function is performed on the bits as 
^0+^1+^2+^3+^4- Then it is used as input to BM 
algorithm to generate the primitive polynomial of the cipher 
mentioned as equation 6: 



l(x) 



.10 



■x A + x 6 +x 2 + 1; 



(6) 



After this step, the Trace form for the annihilator is gen- 
erated as per algorithm specified in [5]. The same can be re- 
verified by using the output of the cipher with annihilator and 
calculating DFT for their cyclotomic cosets leaders: [1, 3, 5, 
7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 
41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 69, 71, 73, 75, 
77, 79, 83, 85, 87, 89, 91, 93, 95, 99, 101, 103, 105, 107, 109, 
111, 115, 117, 119, 121, 123, 125, 127, 147, 149, 151, 155, 
157, 159, 165, 167, 171, 173, 175, 179, 181, 183, 187, 189, 
191, 205, 207, 213, 215, 219, 221, 223, 231, 235, 237, 239, 
245, 247, 251, 253, 255, 341, 343, 347, 351, 363, 367, 375, 
379, 383, 439, 447, 479, 495, 511] The trace representation 
of m is 7: 



U(x) = Tr(a 462 x + a 858 * 3 + a 33 V + a 363 x 7 + a 59 V+ 
a 528 * 11 + a 66 x 13 + a 429 x 17 + a 69 V 9 + a 264 * 21 + a 858 * 25 
+ a 363 x 69 + a 52 V 3 ). (7) 



The final trace form of the annihilator is mentioned as equation 
8: 

G{x) = Tr(a 66 x + a 858 x 3 + a 69 V + a 42 V + 
a 429 x 17 ),x e F 2 io . (8) 

3) Online Mode Computation: The equation 8 will be 
used to recover the initial state or bits of the cipher. The 
bits captured by the adversary are now used to perform fast 
selective DFT attack. The initial state of the LFSR can be 
represented as: 

w 0 = Tr(P),W! = Tr(a/3), w 2 = Tr(a 2 f3),w 3 = Tr(a 3 f3), 
w 4 = Tr(a 4 f3), w 5 = Tr(a 5 f3), w 6 = Tr(a 6 f3),w 7 = Tr(a 7 f3) 
,w 8 = Tr(a 8 [3),w 9 =Tr(a 9 f3) (9) 
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To find out the initial state, f3 needs to be computed and 
captured keystream bits are utilized for this purpose. All bits 
index i where bits are 1 will be used to form system of 
equation on F 2 10 . The equation 8 is substituted for x = a z /3 
to form the equation 10: 

Tr(a 66 ^(3 + a 858+3 ^ 3 + a 693+5 ^ 5 + a 429+9 ^ 9 + 

a 429+1 V 7 ) = o . (io) 

All corresponding f3 n is replaced with equation 11 in equa- 
tion 10, for all n representing cyclotomic cosets of F 2 $ to form 
System of equation 12: 

f3 n — xq + ax i + a 2 X2 + a 3 xs + a 4 X4 + a b x$ + a 6 x$ 
-\- a 7 x-i -\- a 8 x$ -\- a 9 x$ . (11) 

Tr(a 66+ *(xo + ax i + a 2 ^2 + a 3 xs + a 4 ^4 + a 5 X5 + a 6 X6 
+ a 7 x 7 + a 8 x 8 + a 9 x 9 )) + Tr(a 858+3i (x 0 + a^i + a 2 x 2 
+ a 3 xs + a 4 ^4 + a 5 X5 + a 6 X6 + a T X7 + a 8 x$ + a 9 xg)) 
+ Tr(a 693+5 *(x 0 + axi + a 2 x 2 + a 3 x 3 + a 4 x 4 + a 5 x 5 
+ a 6 x 6 + a 7 x 7 + a 8 x 8 + a 9 x 9 )) + Tr(a 429+9i (x 0 + a^i 
+ a 2 ^2 + a 3 xs + a 4 ^4 + a 5 ^5 + a 6 X6 + a T X7 + a 8 x$ 
+ a 9 x 9 )) + Tr(a 429+17i (x 0 + aa;i + a 2 x 2 + a 3 x 3 + a 4 x 4 
+ a 5 x 5 + a 6 x 6 + a 7 x 7 + a 8 x 8 + a 9 x 9 )) = 0 . (12) 

From the linearity of the trace function, system of equation 
is solved to extract values of [xor " X< A- Which is finally 
substituted in 1 1 to get value of j3 to recover the initial state. 



TABLE I 

Comparison of Key Recovery Attacks on WG-7 



Complexity 


DFT 

Attack [10] 


AA 
[11] 


FAA 
[11] 


Our 
Attack 


Offline 










Computation 


0 (2 39 - 5 ) 




(9(2 26 - 87 ) 


(9(2 41 ) 


Keybits 
Requirements 


225.5 


219.38 


2 19.3 


2 18 


Online 










Computation 


0(2 25 - 5 ) 


0(2 54 - 36 ) 


0(2 26 - 73 ) 


0(2 21 ) 



V. Conclusion 

This paper has investigated the security of the WG-7 Stream 
cipher by proposing the use of annihilator in Fast Selective 
DFT Attack. New results of Discrete Fourier Transform attack 
against the stream cipher WG-7 have been presented which 
is an improvement over the previous attack. As a result, the 
complexity of DFT attack on the cipher has reduced con- 
siderably. The key recovery attack employing Fast Selective 
DFT attack recovers the key with key streams requirements 
of about 2 18 with online computation of about 0(2 21 ). The 
presented results reveals that WG-7 stream cipher is not secure 
against DFT attack. The same attack can be extended to WG-8 
[12] cipher after finding out the annihilator of non-linear filter 
function. 
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Abstract — Modern high performance routers rely on customized 
hardware solutions, therefore difficult to adapt to continuously 
changing network protocols. While software routers provide the 
flexibility and programmability, thereby achieving a lower 
throughput. Modern GPUs offer significant computing power, 
and its data parallel computing matches the packet processing on 
routers. As the routing table size increases along with the 
physical link speeds, IP address lookup has been a difficult 
problem at routers. It has been a challenge to achieve a cost 
effective high-performance IP lookup. With the increasing 
Internet traffic and fast changing network protocols, the next 
generation routers have to satisfy the need for throughput, 
scalability, flexibility and QoS. The aim of this paper is to 
compare different GPU based networking algorithm on different 
performance metrics and to discuss their benefits and drawbacks. 

Keywords-Networking algorithms; Software Router;GPU; 
IPLookup; QoS. 

I. Introduction 

The internet and wireless technologies are showing rapid 
growth and have been immensely successful in recent years, 
but due to increase in network size and no. of users, network 
congestion is a recurring phenomenon which causes longer 
delay, packet loss and other performance related degradation. 
Computing is entering an era where an application needs lots of 
task performance. Such applications include multimedia 
applications, high frequency sensing applications, file transfer 
etc. Since the device utilizing these applications need to be an 
integral part of the network, solutions for reducing the effects 
of congestion in wireless networks are required. Accordingly 
Internet routers, have to deliver enhanced processing capacity. 
Modern routers need to perform data intensive applications 
such as intrusion detection along with tradition router 
applications such as packet classification and route table 
lookup which cause the challenge in router throughput. To 
move in synchronization with the growing Internet traffic, high 
performance routers depend on custom hardware. On the other 
hand, proprietary router hardware does not meet the 
requirement for programmability. The custom hardware 
solutions make it hard for routers to get used to varying 
network protocols .On modern routers, packet processing 
involves a series of operations on packet headers by a CPU or a 
network processor. A router has to deliver a high throughput 
under rigorous quality-of- service requirements. The existing 
protocols have to be updated to become accustomed to 
changing network applications. 



Graphic processing units (GPUs) are emerging as a new trend 
in high-performance computing. Multi-core CPUs generally 
utilize the parallel processing at the task level. The GPU 
computing is an example of parallel data programming. A 
GPU would launch multiple threads at the same time for a 
particular work, each thread processing the same program but 
on a different set of data. 

It is advised to use GPUs as a dedicated pocket processor in a 
software router. There are many advantages of using GPU in 
software router. Superior computing power of GPU can be 
utilized for high packet processing output. Also, the software 
router can assembled with off-the-shelf hardware making it 
cost efficient. Another advantage is that GPUs are supported 
by mature software development tools as these tools address a 
mass market. The remaining paper is organized as follows: 
Section 2 gives the description about the GPU based 
networking algorithms. Section 3 performs Comparison and 
Analysis. Section 4 provides the conclusion of the paper. 

II. GPU based Networking algorithms 
A. IP Routing Processing with Graphic Processors 




Figure 1 . GPU based software Router 

Mu [1] suggested GPU solutions for a chain of core IP routing 
applications like IP routing table and pattern match. Bloom 
Filter and Finite Automata based matching algorithm were 
used for deep packet Inspection. GPU based solution for routing 
table lookup is suggested in this work.Fig.l and Fig. 2 shows 
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the GPU based software router and work flow in processing 
matching respectively. 

• Network Intrusion and detection 

In current network devices, network intrusion techniques 
are being used to cater to demand of services like shaping 
of traffic, service quality, and detection of intrusion. The 
challenge is to match signature, which confirm if network 
payloads contain the signatures shared beforehand at line 
rates. In signature matching predefined rule set of strings is 
searched in a packet in the context of DPI. Mu[l] 
developed capable GPU solutions for the string and normal 
expression matching. There are two algorithms which are 
used in this work 

1) Bloom filter 

A Bloom filter is a simplification of hash table. It gives a 
simple, space-efficient data structure to represent a data set 
for fast membership queries. It is a data structure designed 
to ascertain, rapidly and memory-efficiently whether an 
element is present in a set. The Bloom Filter algorithm is 
very suitable for a GPU implementation since the 
evaluations on different strings are totally independent. 
GPU based Bloom filter processing has a throughput of 
about 19Gbit/s, 30 times faster than throughput of CPU 
(0.6Gbit/s). 

2) Aho-Corisick (AC) 

The AC algorithm utilizes Deterministic Finite Automata 
(DFA) for identifying regular expressions patterns. A DFA 
can be expressed with the help of a graph. In this graph 
each nodal point represent a unique state and an edge 
denotes a probable transition between states. An efficient 
data structure for the Deterministic Finite Automata is a 2- 
D transition table; where row represents to a state and 
column correspond to a dissimilar letter in acceptable 
alphabet.For the AC implementation, a throughput of 3.2 
Gbits/s is achieved in paged-lock GPU kernel which is 
almost 5 times faster than CPU (0.6Gbit/s). 



Memory 


packets 


bloom vector or 
trans rtion table 






Process 
J matching 

GPU 



Figure 2. Working Flow in Processing Matching 



• Routing Table Lookup for packet forwarding 

Longest Prefix Match algorithm has been utilized and 
extended further for GPU implementation. The routing 
table is structured like a network structure in which a node 
correspond a state in the search process and line 
connecting nodes represent bits value in the destination IP 
address. The movement or navigation across the network 
path represent and matching exercise. The Route table is 
taken care by the CPU. Whenever the route table is created 
or changed, it will be moved to GPU memory. The parallel 
organization is in that case inconsequential. For processing 
the IP in one packet one thread was used. Pointers handle 
nodes and edges in the radix tree. This kind of pointer 
chasing process is very hard in GPUs. A major problem is 
that the within the tree structure, pointers cannot be 
straightly moved to GPU memory. In the implementation, 
they suggested a new modified data structure called as 
"portable routing table" (PRT). PRT uses displacement in 
place of pointers for tree operations. The routing table is 
kept in the texture memory to utilize GPU's on-chip cache. 
Proposed GPU implementation achieved a 6 times faster 
results. This paper developed efficient GPU 
implementations for the router applications that include IP 
routing table lookup and matching of pattern for network 
intrusion detection. Results displayed that GPU can 
accelerate packet processing by one order of magnitude. 



B. PacketShader: A GPU -Accelerated Software Router 

PacketShader utilizes the extensive-parallel processing power 
of GPU to resolve the CPU bottleneck in current software 
routers. Utilizing the high-performance packet I/O engine, 
PacketShader performs better than the existing software routers 
by a factor of four. Han [2] have done IPv4 and IPv6 
forwarding, OpenFlow switching, and IPsec tunneling to 
exhibit the performance improvement of PacketShader. 
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Figure 3. PacketShader software architecture 

Figure 3 describes the architecture of PacketShader. In 
proposed work first, they implement highly optimized packet 
I/O engine which eliminate memory management overhead for 
each packet and for batch processing of enable high- 
performance packet I/O in user mode. In this process they use 
compact metadata and huge packet buffer, which reduces the 
initialization cost of the metadata structure and eliminates the 
cost of per-packet buffer allocation respectively. 

Processing overheard per-packet is reduced significantly by 
performing batch processing of multiple packets. They 
extended batch processing to the application (packet 
processing) level .To maximize benefit from huge packet 
buffers they perform aggressive software prefetch. The 
compulsory cache miss latency is eliminated by this prefetch of 
the consecutive packets. Multi-core scalability and NUMA 
scalability of packet I/O has also described. 

Second, they offload core operations (ex. IP table lookup or 
IPsec encryption) to GPUs and scale packet processing with 
immense parallel processing. Combined with I/O path 
optimization, GPU utilization is maximized by PacketShader 
by using as large parallelism as possible in a small time frame. 
PacketShader is a multithreaded program running in user mode. 
For Packet I/O, PacketShader invoke the packet API which 
consists of wrapper functions to kernel level packet I/O engine. 
A packet processing application runs on a framework and is 
driven by three functions (pre-shader, shader, post-shader) 

Pre-shading: Each thread obtain a group of packets from its 
RX queues. It drops any incorrect packets and classifies normal 
packets that will be processed with GPU. Then it passes the 
input data to input queue of a master thread. 

Shading: The master thread move the input data from host 
memory to GPU memory, execute the GPU kernel and returns 
results from GPU memory to host memory. Then it puts the 
results back to the output queue of the worker thread for post 
shading. 

Post-shading: A worker threads takes up the results in its 
output queue and perform modification on drops or duplicates 
packets in the chunk depending on the result of the processing. 
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Finally it splits the packets in the chunk into destination ports 
for packet transmission. 

PacketShader uses optimization possibilities in the workflow. 

• Chunk Pipelining 

• Gather/Scatter 

• Concurrent Copy and Execution 

They execute four applications on top of PacketShader, which 
are IPv4 and IPv6 forwarding, OpenFlow switch and IPsec 
tunneling. Each application is executed in two modes, CPU 
mode and the CPU+GPU mode, which evaluate the 
effectiveness of GPU acceleration. PacketShader forwards IPv4 
packets at pace of 40 Gbps for all the size of packets. GPUs 
brought major performance enhancement for memory and 
compute-intensive applications. 

The per-packet processing overhead is minimized in network 
stack and packet processing without severe penalty is obtained. 
PacketShader can be highly scalable with multi-core CPUs, 
high-speed NICs, and GPUs in NUMA systems. The 
effectiveness of the approach is exhibited with IPv4 and 
IPv6forwarding, OpenFlow switching, and IPsec tunneling. 
The evaluation results displays that GPU gives significantly 
higher throughput over the CPU, assuring the usefulness of 
GPU for computation and memory-intensive operations in 
packet processing In this work they have demonstrated that PC- 
based router can achieve 40 Gbps forwarding performance with 
full programmability on commodity hardware. 



C. Achieving O (1) IP Lookup on GPU -based Software 
Routers 

IP address lookup poses a challenge because of the increasing 
routing table size, and higher line rate. Zhao [3] proposed novel 
ways to develop proficient IP lookup method using GPU. Their 
contribution lies in designing a fundamental architecture for 
high-performance IP lookup engine with GPU, and in 
developing efficient algorithms for routing prefix operations 
like lookup, deletion, insertion, and modification. The IP 
lookup scheme can attain 0(1) time complexity. 
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Their GPU- accelerated IP lookup architecture GR is shown in 
Fig.4. 

IPLookup: They proposed a square hash mapping policy in 
routing lookup. In this hash mapping policy the hash function 
takes a 24-bit number input, and provides 2D coordinate as 
output. In this hash function every prefix entry uses a square 
area in the texture, which can be proficiently manipulated by 
graphics processing facilities. 

Modification: A lookup operation will be performed first to 
modify the next-hop information of a given entry. GR will 
write the new next-node information to the target memory after 
locating the memory address. 

Insertion: The process of inserting a prefix entry is simple: 

(1) put the GL depth function to GL GEQUAL, (2) Generate 
prefix's square (3)Draw the square by depth=prefix's slash 
value. 

Deletion: Replacing the value of deleted entry with its parent's 
value which is performed in two passes: In the first pass, the 
deleted square's stencil buffer is set to zero, In the second pass 
the deleted square's stencil's buffer is made equal to its 
parent's depth value. 

In this paper, they provided an alternative approach for 
building cost-effective high-performance IP lookup engines 
and the appropriate routing table update schemes using GPUs. 
Along with the parallelism, IP lookup method that could 
provide O (1) time complexity for each IP lookup. The key 
improvement contain: (1) IP addresses translated in memory 
addresses using a huge routing table on GPU memory. (2)They 
also perform the route update operations by making use of 
GPU's graphics processing facilities like Z-buffer, Stencil 
buffer. 

They presented the design and evaluation of GR that exploits 
the immense parallelism to speedup routing table lookup and 
exhibit that there is the potential for significant improvement, 
e.g., 6 times faster speed than trie-based implementation. They 
also designed better algorithms that can perform deletion, 
insertion and modification operations. 

D. Exploiting Graphics Processors forHigh-performance IP 
Lookup in Software Routers 

IP lookup is a demanding problem at routers due to increase in 
the physical link rate and the routing table size. Demands for 
cost effective high-performance IP lookup has been growing. 
Conventional approaches normally use specialized hardware, 
such as TCAM. While these approaches are advantageous in 
terms of hardware parallelism in achieving high-performance 
lookup, they incur a high cost. Zhao et.al. [4] gives the 
contribution which lies in designing architecture for high- 
performance lookup engine with GPU, and in developing 
efficient algorithms for routing prefix update operations. 
Leveraging GPU's parallelism, the proposed schemes 
addressed the challenges in designing IP lookup at GPU-based 
software routers. 
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Figure 5. GPU-based IP lookup engine 

The proposed GPU-accelerated IP lookup engine, GALE, as 
shown in Fig. 5 

Zhao [4] used a new method for building cost-effective and 
high speed IP lookup engines and the corresponding routing 
table update methods using GPUs. Building an efficient IP 
lookup engine on GPUs, however, is a non-trivial task because 
of the difficult data-parallel programming model provided by 
the GPU, and also due to the significant demand in 
performance. GALE, a GPU- Accelerated Lookup Engine, 
allows high-performance IP lookup inside software routers. 
They made use of the CUDA [23] to facilitate IP lookup on 
GPU in parallel manner. They also utilize an IP lookup 
method that could achieve 0(1) time complexity for IP lookup. 
The key idea include: (1) IP addresses are converted into 
memory addresses with the use of a large direct table on GPU 
memory which stores all possible routing prefixes. To find the 
next-node information only one direct memory access is 
requisite. This scheme can therefore be scaled to extremely 
high line rate with tiny computing overhead and latency for 
accessing the memory. (2)The large routing table although 
efficient for lookup, makes routing updates, which happens 
regularly in core routers, much more time-consuming and 
complex. Multiple prefixes need to be updated when only 
original prefix is added or deleted. This can be resolved by 
performing routing update operations on GPU. 

Lookup: 

Proposed work try to place the next node value of every likely 
IP prefixes in the direct table. In this way each IP address has 
one-to-one mapping with entries in direct table. Direct table's 
entries are derived from trie structure. In a lookup for the 
incoming IP address for example a.b.c.d the left most 
significant 24 bits a.b.c are used as index to its corresponding 
next hop information in the direct table. 

Routing table update: 

a)Insertion and Modification: 

Insert and modify operation are basically the same operation 
for direct table. In a new route prefix, GALE will update the 
new next node value to the appropriate IP prefixes. The update 
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ranges rely on length of new prefix. However, the update for 
the next node values rely on condition of new prefix length is 
more than the existing prefix length. 

b)Deletion: 

The delete operation is similar to modify operation except for 
the period of update operation of the range of entries, the next 
node information is modified to the parent node's nexthop 
information in the trie. Therefore, deleting an entry is done as 
follows: replacing the next node and prefix length of the 
updated entry with corresponding value of parent's next node 
and prefix length. The parent node is attained by the trie- 
traversal for the period of eliminate the entry node in trie by 
CPU. 

Designing efficient IP lookup algorithms in software routers 
that can meet the fulfill objectives of cost-effectiveness, high- 
performance lookup and low cost route-update is a challenge. 
To address above challenge, the following key improvement 
has been made in this paper: (1) they proposed the design and 
evaluation of a GPU-accelerated IP lookup engine, GALE, 
which utilize the immense parallelism for speeding up the 
parallel routing table lookups. (2) Authors also propose using a 
direct table for effective IP lookup with small computing 
overhead and memory access latency. Specifically, it can 
obtain 0(1) computational complexity and only one memory 
access for each lookup request. (3)They finally designed 
efficient methods that perform route-update operations like 
deletion and insertion. Experiment results demonstrated that, 
there is the potential for significant improvement in lookup 
throughput using a proper design. With a better 
performance/price ratio, and very much flexible 
programmability, GPUs are ready to showcase high- 
performance routing processing. 

E. Hermes: An Integrated CP U/GP UMicroarchitecture for 
IP Routing 

With the ever growing Internet traffic and rapid varying 
network protocols, future routers have to fulfill the demand for 
throughput, scalability, QoS, and flexibility. Zhu [5] proposed a 
new integrated CPU/GPU micro architecture, Hermes, for 
QoS-aware high speed routing. Zhu [5] in addition proposed a 
new thread scheduling mechanism, which improves QoS 
metrics significantly. 

1) Hermes is an incorporated CPU/GPU shared memory micro 
architecture contains an adaptive warp issuing mechanism for 
IP packet processing. For optimizing the QoS metrics Zhu [5] 
developed a GPU based packet processing platform. 

2) This architecture is used to implement router related 
application. Authors did extensive performance evaluations of 
QoS. Hermes gives 5X improved throughput compared to GPU 
based router [l].Also an 81.2% lessening in average packet 
delay, as well as a 72.9% decrement in delay variance. 
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A diagram of Hermes is shown in Figure 6. The basic execution 
flow remains the similar to a CPU/GPU system. The packet 
data are stored into the shared memory on arrival and then 
retrieved by shader cores for processing. The processed packets 
are updated in the shared memory, from where they can be 
further processed by CPU or forwarded. 

Adaptive Warp Issuing Mechanism: 

The warp issuing mechanism of Hermes handles assignment of 
parallel tasks to shader cores for any further intra-core 
scheduling. Currently all the thread warps in GPU are kept in a 
warp pool before being issued. 

For packet processing applications, it is unavoidable to wait for 
an enough number of warps under certain circumstances. 
Therefore, they proposed a mechanism which adapts to the 
pattern of arrival of network packets and maintains equilibrium 
between overall throughput and worst-case per-packet delay. 

In the fine-grained multithreading execution model, thread 
warps running on one shader or different shader core may 
finish in an arbitrary order. As a result, processing can happen 
in any order for sequentially arrived packets. Protocols like 
UDP need in-order processing so they use Delay Commit 
Queue (DCQ). 
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Figure 7. Delay Commit Queue with a Lookup Table 

The key idea is to allow out-of-order warps execution but 
enforce in-order commitment. In Figure 7, Delay Commit 
Queue holds the IDs of the warps which have finished although 
not committed yet. Each time the warp is going to issue on 
core, and if DCQ have the space, new entry will be assigned. 
Mapping between warp ID and the DCQ entry ID is placed in a 
LUT. When finishing, the appropriate DCQ entry is modified 
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by indexing the Lookup table, with the completed warp's ID. 
When all warps arrived previously has been finished only then 
a warp could be committed. When a warp commits, its DCQ 
entry will be reclaimed. 

A complete set of router applications were implemented on the 
Hermes architecture. Experimental results showed that this 
architecture could meet strict delay requirements, maintaining a 
high processing throughput at the same time. 

F. Scalable Packet Classification via GPU 
Metaprogramming 

Packet classification is a primary processing pattern of modern 
networking devices, contemporary routers use special hardware 
meant for packet classification, however such solutions are not 
cost efficient, have high power consumption and poor 
extensibility, while software-based routers gives the best 
flexibility, but deliver less performance . GPUs have proved to 
be an effective accelerator for software routers. Kang[6] 
proposed a GPU-based linear search framework for packet 
classification. The crux of this framework is a programming 
technique that significantly enhances the execution efficiency. 

GPU packet classification using DBS algorithm: 

DBS algorithm was implemented on both GPU and CPU. 
Preparation was conducted on CPU in both versions. GPU 
threads are organized into 120 blocks, each consisting of 128 
threads. 14,336 packets are processed during each kernel 
launch in a single batch. 

In DBS the partitioning of rule sets is limited to certain 
granularities, which suggest a performance bound for more 
general algorithm categories than only DBS or hash-based 
algorithms. The limit is due to the overlapping of rules in the 
classification space. 

GPU packet classification using linear search: 

1) Each rule is translated as a programming logic in C which 
checks if the packet matches. The rules are now embedded as 
compilation-time constants. 

2) The series of kernels is converted to code fragments, which 
is compiled and upload to the graphics card in binaries. 

3) Packet header data is loaded into GPU memory, kernels are 
executed, and then classification results are moved back to 
CPU. 

This technique eliminate fetching of rules from memory; the 
rule set is compiled as a C program code, which can be 
effectively broadcasted to all CUDA cores on a GPU with the 
instruction issuing approach. The memory latency for 
instruction binary can be hidden by GPU's instruction cache. 

Authors investigate effective data-parallel algorithms for 
packet classification, following the direction of previous works 
on GPU-based software routers [1] [2]. A main observation is 
that many algorithms have low theoretical complexity but have 
poor scalability with large rule set and does not meet the 
performance constraints .Authors resolved the scalability issue 
by focusing on efficient parallelization and avoiding 
complicated algorithm. They proposed a GPU-based linear 
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search scheme for packet classification by adopting a novel 
meta programming approach. Experimental results demonstrate 
that it significantly outperforms existing software solutions, 
even for large rule sets. 



GPU linear search gives 17 X higher throughput than Discrete 
Bit Selection algorithm on CPU or 8.5 X higher than its own 
CPU counterpart. The prototype implementation offers 
significantly higher classification throughput than its CPU 
counterpart, while at the same time maintains good scalability 
even when the size of rule set reaches 50,000. They also 
demonstrated that meta programming can be a powerful tool 
for GPU computing. 

G. An Analysis of Queuing Network Simulation Using GPU- 
Based Hardware Acceleration 



Queuing networks finds its application broadly in computer 
simulation studies. Queuing networks are supply chains, 
manufacturing and routing. If the networks are small in size 
and complexity, discrete event simulations of the networks can 
be created without incurring major delays in analyzing the 
system. However, as the networks size grows, such kind of 
analysis can be time consuming exercise and thus need 
expensive parallel processing computers or clusters. 
Park et.al.[7] have developed tools that simulate queuing 
networks in parallel, using the fairly economical and generally 
available graphics processing units (GPUs) found in most 
recent computing platforms. They gave an analysis of a GPU- 
based algorithm, highlighting pro and cons with the GPU 
approach. An algorithm cluster events, attaining speedup at 
the expense of an estimate error which grows as size of cluster 
enhance. They were able to achieve 10-x speedup using this 
approach with only a small error in a particular 
implementation of a closed queuing network simulation. Error 
can be reduced, base on error analysis tendency, thus 
achieving reasonably correct output statistics. The 
experimental results of the MANET simulation displays that 
errors happen merely in the time-dependent output statistics. 
Authors examined three different queuing models to study the 
effects of their simulation method: the general network 
topologies of open and closed queuing networks, and 
computer network models. They analyzed the trade-offs 
between numerical errors and performance gain and the 
methods for error estimation and correction. They used 
parallel execution for reducing the overall execution times in 
the simulation on the GPU. 
Time Synchronous /Event Algorithm: 

This algorithm is a mixture of discrete event simulations and 
time stepped simulations. In this approach they execute the 
events at the end of the time interval to increase the extent of 
parallelism. If events are executed only at the end of the time 
interval, each event will have delayed execution with respect 
to its original timestamp and the results will lose accuracy. 
They rely on queuing theory for estimating the total error rate 
of simulation results. A non decreasing timestamp order of 
execution of events is preserved by the synchronous step of 
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simulation, which blocks the event extractions from the FEL 
before the current events finish scheduling the next events. 

Closed and open queuing network: 




Figure. 8 Closed queuing network 




Figure 9. Open queuing network 



Fig. 8 and Fig. 9 shows closed and open queuing networks 
respectively. In an Open queuing network the number of 
tokens in the network at any point is always different due to 
the arrivals and departures where as in closed queuing network 
the number of token are constant during the simulation.lt is 
possible to have a fixed size array for the FEL in the closed 
queuing network because of the fixed no. of tokens during the 
simulation. In an open queuing network, the size of the FEL is 
always different, so size of FEL is set to double the size of 
tokens, due to the variable number of tokens at any instant of 
time. 

Computer Network Model: 

Queuing model were developed to analyze computer network 
systems performance. Fig. 10 shows the Queuing delay in 
computer network model. There are four types of delays when 
a packet is sent from one node to adjoining node which are: 
processing, queuing, transmission and propagation delays. 
Queuing delay is most studied among all these delays as it is 
the only delay caused by the traffic load and congestion 
pattern. 
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Error Analysis: 

They have found error to be frequent enough for 
approximation. So they developed error estimation and 
correction methods to provide more precise statistics. The total 
error rate equation is derived from the queuing theory 
equations. These equation coupled with the results of the 
simulation provide more accurate results without developing a 
complicated analytical model from each node. They analyzed 
their method impact on different queuing models. The 
problem of event distribution and inapt application of GPUs 
for discrete event simulation are dealt (1) by allowing events 
to occur at approximate boundaries at the cost of accuracy and 
(2) by using a detection and compensation method for error 
reduction. The experimental results exhibit 10 x speed up 
using their algorithm at the cost of accuracy in the results. The 
statistical results in MANET simulations show that their 
method only has an error in the time-dependent statistics. 
Although the improvement in performance caused error into 
the results of simulation, the experimental results showed that 
the error in the queuing network is frequent enough to estimate 
more accurate results, which can be reduced from queuing 
theory. 

IE. Comparison and Analysis 

From the above GPU based networking algorithms it can be 
seen that they differ in terms of their application along with 
throughput, delay, QoS etc. As illustrated in Table 1, Different 
networking algorithms are executed in parallel manner with 
improvement so that achieved results of these algorithms are 
better than the existed ones. 

IV. Conclusion and Future Work 

Among several network devices, Internet routers play an 
important role by serving as the backbone. A router delivers 
packets between connected networks in a timely manner. 
Conventional routers rely on custom hardware for achieving 
high processing speed. Also being costly, such solutions are 
not adaptable to fast changing network services and diversified 
usage cases. On the other hand, software routers have been 
attaining attention of the researchers due to their extensibility 
and customizability. Software routers offer flexibility and 
cost-efficiency as they are built using commodity hardware. 
However, despite having better flexibility, software routers 
cannot meet the performance level required for growing 
Internet traffic. The application of software routers is limited 
because of this low performance. There has been a move in 
high-performance computing to carry out general purpose 
computing with graphic processing units (GPUs) due to its 
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cost efficiency and availability. For compute-intensive 
applications, the large number of array of GPU cores offers 
higher computation power. In this paper we compared 
different GPU based networking algorithms on different 
parameters. We have found that GPU has shown a substantial 
performance boost in networking algorithm and it will be the 
helpful for extending the original networking protocols to next 
generation high speed systems. 
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TABLE 1: Comparison of GPU based networking algorithms 



Protocols 


Algorithms used 


Performance metrics 


Speed up 


IP Routing Processing with Graphic Processors[l] 


(Bloom filter, Aho-Corisick), Routing 
1 able Lookup 


Throughput, Lookup time 


30,5,6 


PacketShader[2] 


IPv4 Forwarding 


Throughput 


4 


Achieving O(l) IP Lookup on GPU-based Software 
Routers [3] 


IPLookup 


Lookup time 


6 


GALE [4] 


IPLookup 


Lookup per second 


5-6 


Hermes:An Integrated CPU/GPU Microarchitecture for IP 
Routing [5] 


IP Routing Processing, QoS evaluation 


Throughput, delay , delay 
variance 


5,81%,73% 


Scalable Packet Classification via GPU 
Metaprogramming [6] 


Linear search 


Throughput with scalability 


8.5 


An Analysis of Queuing Network Simulation Using GPU- 
Based Hardware Acceleration[7] 


Queuing Network Simulation 


Event execution time 


10 
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Abstract: Textile industry is one of the largest and oldest sectors in 
the India and has a formidable presence in national economy in 
terms of output, investment and employment. Due to increasing 
demand for quality fabrics it is thus important to produce the defect 
free high quality fabric. Visual inspection system consumes a lot of 
time and are error prone. The price of the fabric is reduced to 45%- 
65% due to presence of various defects. The purpose of this paper is 
to automate the detection and classification of texture defects by 
computerize software. The proposed method uses a statistical based 
approach for the inspection and detection of the defect on 
woven/knitted fabric collected from the textile industry. In this the 
images are acquired, pre-processed, restored and normalized to 
extract the statistical feature using computer vision. The extracted 
features are given an input to the artificial neural network decision 
tree classifier to compute the weighted factor for detecting and 
classifying the type of defects. An automatic defect detection system 
can increase the texture defect detection percentage and will reduce 
the fabrication and labour cost and improves the quality of the 
product. 

Keywords: Defect detection, Statistical approach, Computer vision, 
Decision tree classifier, neural network. 



I. 



INTRODUCTION 



The Indian textile industry or apparel industry is one of the 
most important industry for the Indian economy. It primarily 
concerned with the production of yarn and cloth. Its 
importance is underlined by the fact that it accounts for 
around 4% of GDP, 14% of the industrial production and 
17% of the country's total export earnings. All textile 
industries aim to produce competitive fabrics which is based 
on the productivity and quality of the fabrics produced by 
each industry. In the textile sector, there have been an enlarge 
amount of losses due to defective fabrics. A fabric defect is 
any abnormality in the fabric that hinders its acceptability by 
the consumer. Various types of faults present on knitted and 
woven fabrics are as hole, scratch, stretch, dirty spot, fly 
yarn, cracked point, slub (thickness of yarn), colour 
bleeding(poor wet fastness), dye spot, broken lines, knots, pin 
marks, missing yarn, thin yarn, thick yarn, and bad selvage 
etc. if these faults are not detected properly then they can 
affect the production process massively. 

The fabric texture is usually 
made up of the repetition 
arrangement of warp and weft. 
The longitudinal threads are 
called the warp and the lateral 
threads are the weft or filling. 
The method in which these 



Warp 




threads are inter woven affects the characteristics of the cloth 
and if not woven properly produces the defect in the fabric. 
A wide variety of defects are represented many defects are a 
direct cause of machine malfunction while others are from 
faulty yarns [9] . The various types of defects detected during 
quality controls are broadly classified as follows [16] [17]: 

• Critical Defects - Defects which cause hazard to the 
health of individuals using it. 

• Major Defects - More serious defects which are likely to 
affect the purchase of the product. 

• Minor Defects - Include small faults which have no 
effects on the purchase of the product 



Some of the commonly occurring fabric defects 
discussed in the below mention table (I) from [4]. 



are 



TABLE I. 



VARIOUS TYPE OF FABRIC DEFECTS 



Yarn Defects 


Woven 


Knitted 


Dyeing & 




Defects 


Defects 


Finished 
Defects 


The defects 


The defects 


The defects 


The defect 


originating 


which 


which occurs 


which occurs 


from the spinning 


originate 


during knitting 


during the 




during the 


of cloth 


dyeing of 




process of 




textile products 




weaving 






Broken Filaments 


Broken Ends 


Drop Stitches 


Shade Variation 


Knots 


Float 


Yarn Streaks 


Crease Mark 


Slub 


Gout 


Barriness 


Pin Hole 
Damage 


Fabric press off 


Hole, Cut or 
Tear 


Fabric press off 


Dye Spots 


Broken ends 


Oil Stain 


Broken Ends 


Wrong Slitting 


Thick places 


Slub 


Spirality 


Band Line 


Thin places 


Missing end 


Slub 


Dust 




Missing 
Picks 


Pin Hole 






Reed Mark 


Broken Needle 






Colour 


Cracks or Holes 






Bleeding 







Manual defect detection in a fabric quality control system is 
a difficult task to be performed by inspectors. It has been 
observed [2] that price of textile fabric is reduced by 50% to 
70% due to defects. The work of an observer is very tedious 
and time consuming. They have to detect small details that 
can be located in a wide area that is moving through their 
visual field. The identification rate is only about 70% [3]. So, 
early and accurate detection of defects in fabrics is an 
important aspect for product and quality improvement. 
Human visual inspection and automated inspection are 
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compared in Table II from [4] . Computer vision technology 
with artificial neural network have been applying on textured 
samples over the past few years for developing automated 
defect detection and classification systems. Automatic fabric 
defect detection systems mainly faces two challenging 
problems; 

1. Defect Detection 

2. Defect Classification. 

Feature selection and image enhancement plays an important 
role in developing automated defect classification system. 
Image processing techniques deal with image acquisition, 
segmentation, manipulation, and analysis of images. The 
advantages of digital imaging are accurate data acquisition, 
better combination of spatial and contrast resolution, 
compact storage/easy retrieval, fast accurate image 
transmission. For selecting an appropriate feature set, the 
distinguishing qualities of the features should be high and the 
number of features should be small and takes into account 
the difficulties that lie in the feature extraction process [1]. 
Neural networks and decision tree classifiers [35] are 
suitable enough for developing real-time systems because of 
their parallel-processing capability. Moreover, NNs have 
strong capability and good accuracy to handle multiclass 
classification problems. Classification accuracy, model 
complexity and training time are three of the most important 
performance metrics of NN models. 

TABLE II. COMPARISON OF HUMAN AND 

AUTOMATED DEFECT INSPECTION 

Inspection Type Visual Inspection Versus Automated 

Inspection 



Visual Automated 

Fabric Types 100% 70% 

Defect Detection 70% 80%+ 

Reproducibility 50% 90%+ 

Objective Defect 50% 100% 
Judgement 

Statistics Ability 0% 95%+ 

Inspection Speed 30m/min 120m/min 

Response Type 50% 80% 

Information Content 50% 90%+ 

Information Exchange 20% 90%+ 



The remaining section of this paper is organized as follows. 
Section 2 describes relevant previous efforts in the fields, 
such as textile fabric inspection systems, computer vision 
and machine learning systems like neural networks for 
automated textile defects recognizing etc. Section 3 provides 
the research objective and approach used to automate the 
proposed textile defect detectors. Section 4 describe the 
proposed system in detail including the method for defect 
extraction, the scheme of feature extraction, and the structure 
of the neural network decision tree classifier. Section 5 gives 
the discussion and observations. Section 6 concludes with 
some remarks and plausible future research lines. 

II. LITERATURE REVIEW 

Fabric defect detection using digital inspection images has 
received considerable attention during the past decade and 



numerous approaches have been proposed in the literature 
[1-45]. At microscopic level, the inspection problems 
encountered in digital images become texture analysis 
problems [5]. Revathy-Vijaylakshmi [6] obtained the 
principal components from the co-occurrence matrix of 
fabric texture and further fuzzy rule based classification is 
done. It has been found that by using thresholding techniques 
90% of the defects in a plain fabric could be detected and 
have used resilient backpropagation algorithm to train their 
NN [7] [8]. Their networks have been capable of dealing with 
multiclass problem. Sebeanni [9] and Ajay [5] have 
introduced gray-level statistical method for detecting the 
defects from the textile fabrics. Mac-Yiu [10] and [11] has 
given morphological filters techniques to tackle the problem 
of automated defect detection for woven fabrics. In the 
proposed scheme, important texture features of the textile 
fabric are extracted using a pre-trained Gabor wavelet 
network [10]. These methods depend on intensity change on 
the fabric image, can only capture significant defects such as 
knot, web, and slub. The reduction of wastage, higher price 
of fabrics due to the presence of fewer defects, requirement 
of less labour, and other benefits make the investment in an 
automated textile defect inspection system economically 
very attractive. The development of a fully automated web 
inspection system requires robust and efficient defect 
detection and classification algorithms [28] [29]. The 
inspection of real textile defects is particularly challenging 
due to the large number of textile defect classes, which are 
characterized by their vagueness and ambiguity [12]. The 
justification for fabric defects could be described to the fact 
that no production or manufacturing process is 100% defect- 
free which applies particularly where natural materials, as 
textile ones, are processed [13]. There are several reported 
works [14, 15, 16, 17, 18 and 20] discuss the influence of 
fabric defects on textile industry. 

In the last two decades, there have been several key 
developments in automated visual inspection technique for 
fabric defects where new approaches such as an ultrasonic 
imaging system [19] have also been proposed. 

To deal with challenges of machine vision based fabric 
inspection system, numerous attempts have been made all 
around the globe in developing techniques to detect and 
classify fabric defects [3-20]. Most of them have 
concentrated on defect detection, where few of them have 
concentrated on defect classification. There have been 
deployment of mainly three defect-detection techniques [12], 
namely, statistical, spectral, and model-based. A number of 
techniques have been deployed for classification. Among 
them, neural network, support vector machine (SVM), 
clustering, and statistical inference are notable. But, the main 
common alternative to human visual defect detection is the 
use of a computer vision system to detect differences 
between images acquired [20]. This means that texture 
analysis plays an important role in automatic visual 
inspection of surface [17, 21, 22, and 23] features. 

It is, however, worthwhile to recall that fabric defects are 
loosely separated into two types [17]; one is global deviation 
of colour (shade); the other is local textural irregularities 
which is the main concern for our study. Co-occurrence 



49 



http://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



(IJCSIS) International Journal of Computer Science and Information Security, 
Vol. 13, No. 5, May 2015 



matrix technique is based on different grey level 
configurations in a texture fabrics [16, 24]. This co- 
occurrence technique can be computationally expensive for 
the demands of a real-time defect inspection system. The 
number of gray levels is usually reduced in order to keep the 
size of the co-occurrence matrix manageable [34]. The 
process at which these defects are detected is called fabric 
inspection. In addition, the visual inspection has worked well 
for many years in part because the amount of data has been 
small and manageable. Automatic inspection systems are 
designed to increase the accuracy, consistency and speed of 
defect detection in fabric manufacturing process to reduce 
labour costs, improve product quality and increase 
manufacturing efficiency [1-37]. We will categorise the 
texture analysis problem into four approaches according to 
the used algorithm; structural, statistical, spectral, and 
model-based approaches. Among all statistical approaches 
are very popular. The above part of the literature presents in 
Table III summarizes our modified automated texture defect 
detection and classification techniques and the various 
algorithms used in the research. 

TABLE III. LITERATURE REVIEW OF AUTOMATED 

TEXTURE DEFECT DETECTION AND 
CLASSIFICATION 



Approaches Methods References 

Structural [25,17,16] 

Approaches 

Statistical Grey Level Thresholding [16,17,26, 24,18] 

Approaches Cross Co-relation [16, 17, 40] 

Statistical Moment [17, 22, 23] 

Histo gram Equalization [25, 22, 6] 

~Rank-Order Function [16,27] 

Fractal Dimensions [16, 17] 

Edge Detection [14] 

Morphological Operations [16, 17, 10] 

Eigen Filters [16] 

Co-occurrence Matrix [25, 16, 17, 28] 

Artificial Neural Network [21, 29, 13, 30, 20, 

35,26,31,32, 33, 
34, 36, 14, 37, 7] 



Decision Tree Logic using [35] 
Artificial Neural Network 



Optimal Filter Design [21] 

Spectral Fourier Transforms [38] 

Approaches Gabor Filters [10] 

Wavelet Transforms [24, 39] 



Model Based Gauss MRF Model, [24,40] 
Approaches Poisson's Model, Model 
Based Clustering 



III. RESEARCH APPROACH 

Textile industry plays major role in everyone's life. Quality 
inspection is an important aspect of all industries while 
producing the quality product. There are lot of problems 
faced by the apparel industries such as: 

• Inline inspection in the textile industry 

• Fabric shrinkage 

• Estimation of the porosity of textile fabrics. 

• Measuring of thread densities 

• Unavailability of quality monitoring tools. 

• Inferior quality of raw materials 

• Operations in a critical zone of the garment. 

• Sewing on bias cut. 



• Defect detection and classification. 

Among all defect detection and classification plays a very 
significant role in producing quality product at spinning, 
weaving and processing side. There are lot of methods and 
algorithms has been developed on this but still research is 
going on for better method. 

Machine vision automated inspection system for textile 
defects has been in the research industry for long-time [30]. 
Numerous techniques have been developed to detect fabric 
defects. The characterization of real fabric surfaces using 
their structure and primitive set has not yet been successful. 
Therefore on the basis of nature of features from the fabric 
surfaces, the proposed approaches have been characterized 
into three categories; statistical, spectral and model -based. 
However, random textured images cannot be described in 
terms of primitives and displacement rules as the distribution 
of gray levels in such images is rather stochastic. Therefore, 
spectral approaches are not suitable for the detection of 
defects in random texture materials [17]. The Fourier 
Transform (FT) has the desirable properties of noise 
immunity and enhancement of periodic features. Chan and 
Pang [40] used the Fourier analysis for fabric defect 
detection. The Fourier Transform is an analysis of the global 
frequency content in the signal, it is not able to localise the 
defective regions in the spatial dependency [17] in the spatial 
dependency into Fourier analysis through the windowed 
Fourier transform becomes the well-known Gabor transform, 
which can be achieving optimal localisation in the spatial 
and frequency domain [10]. In [41], Gabor filters are 
designed on the basis of the texture features extracted 
optimally from a non-defective fabric image by using a 
Gabor wavelet network (GWN). A major difficulty of this 
method is how to determine the number of Gabor channels at 
the same radial frequency and the size of the Gabor filter 
window in the application [17, 34, and 38]. Model - based 
texture analysis methods are based on the construction of an 
image model that can be used not only to describe texture, 
but also to synthesize it. Model-based approaches are 
particularly suitable for fabric images with stochastic surface 
variations [17, 34]. An important assumption in statistical 
approach is that the statistics of defects free regions are 
stationary, and these regions extend over a significant 
portion of inspection ages [17, 34]. Co-occurrence matrix 
technique is based on different grey level configurations in a 
texture fabrics [16, 24] This co-occurrence technique can be 
computationally expensive for the demands of a real-time 
defect inspection system. The number of gray levels is 
usually reduced in order to keep the size of the co-occurrence 
matrix manageable [34] . Texture properties can be extracted 
by using several bi-dimensional transform such as Discrete 
Cosine Transform (DCT), Discrete Sine Transform (DST), 
Discrete Hadamard Transform (DHT), Karhunen-Loeve 
Transform (KLT) and Eigen filtering [17,34 30, 38]. This 
back propagation based neural network coupled with the 
DCT technique can lead to outstanding results for 
classification of various fabric defects[ll].The defect 
detection approaches [14, 40] using edge detection are 
suitable for plain weave fabrics imaged at low- resolution. 
The difficulty in isolating fabric defects with the noise 
generated from the fabric structure results in high false alarm 
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rate and therefore makes them less attractive for textile 
inspection [34]. The Cross Correlation is used for locating 
features in one image that appear in another and therefore 
provides a direct and accurate measure of similarity between 
two images [16, 17 and 38]. Randomly textured backgrounds 
do not correlate well and demonstrate a limitation of this 
approach [17, 36]. Use of gray level thresholding enables to 
detect high contrast defects. The defect detection can be 
effective even when web is covered by a fine and complex 
pattern [16, 17, 22 and 24. The fabric inspection system that 
uses thresholding, proposed by Stojanovic et al. [42], gives 
high detection rate with good localization accuracy and low 
rate of false alarm. Detecting defects morphologically on 
spatially filtered images of fabrics produces better results 
[16, 17, 10 and 38], particularly when the fabric is fine and 
contains defect of small size. Thus the morphological 
operations are only performed on aperiodic images defects, 
unlike the case in [43] where the entire structure of 
thresholder fabric image was utilized. Neural networks are 
one of the fastest most flexible classifier used for fault 
detection due to their non-parametric nature and ability to 
describe complex decision regions [17, 34, 38 and 45]. A 
new approach for the segmentation of local textile defects 
using feed-forward neural network (FFN) and also a new 
low-cost solution for the task web inspection using linear 
neural network is presented in [35]. Loganathan and Girija 
[46] have used back propagation neural network, with fuzzy 
logic, to achieve the classification of eight different kinds of 
fabric defects along with defect-free fabric. The real-time 
implementation of defect segmentation scheme using FFN is 
computationally costly. Although the real time 
computational complexity of SVM is also similar, but do not 
suffer from the problem of local minimum and is 
computationally simple to train [17 45]. Habib [20], 
presented a novel hybrid model through integration of 
genetic algorithm (GA) and neural network to classify the 
type of garment defects. Experimental results for real fabric 
defect detection, shows the usefulness of the three intelligent 
techniques and they further stated that NN has a faster 
performance. Online implementation of the algorithms 
showed they can be easily implemented and may be adapted 
to industrial applications without great efforts. Their 
experimental result shows that this method can effectively 
detect defects and classify the types of defects with high 
recognition correct rate [17, 34, 38 and 45]. 

The over-all objective of this research it to develop an 
automatic defect detection and classification tool based on 
computer vision and artificial neural network decision tree 
classifier using resilient backpropagation algorithm. This 
application is robust alternative to traditional human visual 
systems. The achievement of this goal means that the 
following other sub -objectives have been consequently 
achieved: 

1 . Available literature is reviewed that addresses the fabric 
defect categorization and classification and also various 
technologies available for the defect detection and 
classification. 

2. In this research we will deal mainly with two challenges 
i.e. defect detection and defect classification process. 



3. Development of a fabric defect map will be generated 
with the help of industry experts to determine the 
major/minor/critical defects in knitted and woven fabrics 
which should be considered during the pre-processing 
step. 

4. Acquisition or capturing of a sufficiently large fabric 
database of images in TIF/JPEG format with and 
without defects at different resolution levels for 
detection and classification process will be stored. 

5. Suitable procedure using a software package (SCILAB 
or MATLAB) to implement the proposed technique 
(Computer Vision with Artificial Neural Network) will 
be developed. 

6. Developing a methodology to extract defect features 
from various fabrics using various image processing 
techniques. 

7. Identifying and optimizing the main parameters from the 
defective images which affects the defect detection 
process 

8. Firstly the training of the images are done on the 
simulated fabric containing the chosen major defects, to 
understand the behaviour of the frequency spectrum, 
determine and optimize the most important detection 
parameters. 

9. Designing of a computerized software to identify and 
classify the various defects using neural network 
decision tree classifier algorithm. 

10. A computer demonstration of a sequence of steps from 
the pre-processing through the final detection 

11. Test and verify the success of the technique using real 
plain fabric samples containing the same simulated 
defects. 

12. Design and development of a prototype to examine the 
technique in real-time (during the production of the 
fabric on the weaving machine) that is the main object 
of this system. 

13. Knowledge base for the expert system to provide online 
adaptive capabilities is efficiently generated. 

IV. PROPOSED FRAMEWORK 

This proposed methodology mainly focuses on the 
combination of image processing and artificial neural 
networks in textile industries research arena. 

The main motive of this proposed method is to develop an 
economical automated system for fabric defect detection and 
classification for texture defects in textile industry. The 
purpose of the developed system is to reduce the labour cost, 
time, increase the productivity of the products and as well as 
increase accuracy in the inspection process and propose a 
better method. The location, size and image of the defect are 
recorded in the system. After the inspection process, the 
product will be graded in terms of severity and the detailed 
report will be generated. In addition to the use of standard 
image processing functions for enhancing and modifying the 
digital image, the paper will describe the techniques from 
artificial neural network for classifying the knitted and 
weaving defects. 

A. Proposed Model: 
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In this work, a novel approach for defect detection and 
classification based on neural network decision tree classifier 
is presented. Based on the research, the proposed system 
design is divided into five parts. The first part of the defect 
recognizer focuses on the acquisition of image; second part 
involves the processing and normalization of the images to 
detect the faults; in the third part the image is filtered or 
noise is removed by using adaptive median filtering 
techniques into binary image by restoration and threshold 
techniques; during fourth part features are extracted from 
the pre-processed image; and in the fifth part the extracted 
features are given an input to the neural network decision 
tree classifier for further defect detection and classification. 
This whole software component is implemented using 
MATLAB. The block diagram for the proposed framework 
is given in Figure 1 . 



Image Acquiring 

Device 
(Camera's etc.) 




Image 
Acquisition 



Image 
Normalization 



Image 
Segmentation 



Feature 
Extraction 



Defect Detection 
and Classification 
using Decision 
Tree Classifier 



Decision Logic for 




-► 


Desired Output 


Defect Detection 







Figure 1. Computer Vision and Artificial Neural Network Framework 
for Defect Detection and Classification for texture Images 

MATLAB is a high-level language and creates user 
interactive environment for numerical computation, 
visualization, and programming. MATLAB analyse data, 
develop algorithms, and create models and application. HDF, 
JPEG, PCX, TIFF, BMP, XWB are the image formats that 
can be used in MATLAB for processing. A brief overview of 
the process of MATLAB simulation for the method proposed 
is shown in figure 2. 

B. Methodology: 

1. Image Acquisition: The digital analysis of two- 
dimensional images of fabric is based on processing the 
image acquirement, with the use of the high resolution 
camera and computer. The most important parameter 
used in the image acquisition is the resolution. Either the 
size of one pixel or the number of pixels per inch can 
refer the resolution of an image. The lower the image 
resolution, the less information is saved and higher 
resolution means more information is saved but larger 
memory size is required to store. The scanning of fabric 



images begins from 300 dpi resolution because human 
vision is approximately 300 dpi at maximum contrast. 
The scanned image is stored in 'tiff /jpeg' or in grayscale 
format. 
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Figure 2. MATLAB Software Simulation Flow 



2. Image Normalization: The acquired image may 
contain noise; noise is the result of errors in the 
image acquisition that result in pixel values. Noise 
reduction, filtering and thresholding is the process 
of removing noise from the image by using various 
image pre-processing techniques. In averaging 
filter, replace each pixel by the average of pixels in 
a square window surrounding this pixel. Larger 
window can remove noise more effectively but also 
blur the details/edges. The filtered image is 
converted into binary image, then the area of binary 
image is calculated 

3. Image Segmentation: Image segmentation is 
typically used to locate an objects and boundaries 
(lines, curves, etc.) in images. Image segmentation 
is the process of assigning a label to every pixel in 
an image such that pixels with the same label share 
certain visual characteristics. Several methods [39] 
can be used to segment the defect, i.e., detect the 
defect on the image, ranging from simple 
segmentation methods (e.g., thresholding) to more 
advanced methods that combine background 
subtraction. 

4. Feature Extraction: In this process the histogram 
of an image is drawn. The feature area is extracted 
from the binary image. This binary image is used to 
calculate the following attributes: 

• The area of the faulty portion: calculates the 
total defected area of an image. 

• Number of objects: uses image segmentation to 
calculate the number of labels in an image. 

• Shape factor: distinguishes a circular image 
form a noncircular image. 
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• Height of the overall defect windows 

• Width of the overall defect window 

• Ratio of total defect area to the overall window 
area. 

• Number of defect occurs in the overall defects 
window. 

• The ratio of the smallest defect area over the 
largest defect area. 

These various attributes are used as input sets to 
adapt the neural net through training set in order to 
recognize and classify expected defects 

Defect Detection and Classification using ANN 
Decision Tree Classifier: A neural network 
decision tree classifier is designed for classifying 
the type of defects. The various features describe in 
feature extraction technique is used as an input to 
the NN classifier. Initially a simple threshold value 
is calculated to identify the critical defects from the 
major and minor defects. A 3 -layer Neural Network 
is designed for this purpose. During the training 
phase, the training data is fed into to the input layer. 
We are able to classify the defects with the 
attributes obtained from feature extraction 
technique. ANN contains four to six computing 
units in input layer, twelve to 25 computing units in 
hidden layer and six to 10 computing units in output 
layer .Each computing units in the output layer 
corresponds to each defect type. The data is 
propagated to the hidden layer and then to the 
output layer. Similarly each node in output layer 
gets input from all the nodes from hidden layer, 
which are multiplied with appropriate weights and 
then summed. The target output values are those 
that we attempt to teach our network. The error 
between actual output values and target output 
values is calculated and propagated back toward 
hidden layer. This is called the backward pass of the 
back propagation algorithm. The error is used to 
update the connection strengths between nodes, i.e. 
weight matrices between input-hidden layers and 
hidden output layers are updated. Figure 4 shows 
the block diagram of decision tree classifier and the 
way how it will identify the critical, major and 
minor defects of knitted and woven fabrics. 

Decision Logic for Defect Classification: A 

decision tree is constructed based on the output 
extracted from the neural network classifier to 
detect and classify the various defects. Figure 3 
shows the steps required in building of an automatic 
defect detection and classification system. 
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Figure 3. Steps required in development of Automatic 
Defect Detection and Classification of Texture Fabric Defects 
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Figure 4. Block diagram of Decision Tree Classifier 

The study result shows that the proposed method is feasible 
in textile production factories for defect detection and 
classification and can achieve success rate up to 90-95%. 
Figure 5 shows the detailed block diagram of defect 
detection and classification for texture defects in textile 
industry. Neutral network are one kind of the best classifiers 
for defect detection due to their non-parametric nature and 
ability to describe complicated decision region. In this 
proposed work neural network decision tree classifier is 
trained for classifying the defect types of textile fabrics. 
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Figure 5. Block Diagram of Defect Detection and Classification for 
Texture Defects in Textile Industry. 



The performance of the system can be evaluated by using 
different fabric images with defects and without defects. 

After the integration of the defect detection algorithms and 
the artificial neural network classification algorithms to the 
final system. The proposed approach will be tested on the 
experimental platform. The results which can be generated 
from this approach are summarized in table (IV). 



TABLE IV. 



RESULTS OF EXPERIMENTS 



Parameter 


Value 


Number of Defects 


10-12 Defects 


Classification Accuracy with 


80-85% 


present of Noise 




Classification Accuracy 


90-93% 


without Noise 




Maximum Inspection Speed 


120m/min 


Width of Testing Fabric 


1 m 


Space Inspection Resolution 


1mm 2 


Overall Accuracy for 


90-95% 


Detection of Defect 





V. DISCUSSION AND OBSERVATIONS 

The automatic defect detector and classifier captures fabric 
images by acquisition device (digital camera) and passes the 
image to the computer. Initially by using image processing 
techniques defect detector normalizes the image and filtered 
it with adaptive median filtering. The number of connected 
components and their region property area with bounding 
box is calculated. Taking the value of area as threshold the 
image is converted into binary image. The texture features 
are obtained from the binary image which act an input to the 
neural network. The input layer consists of 4-6 neurons, 
hidden layer consists of around 25 neurons and output layer 
consists of 5-10 neurons. The number of neurons at output 
layer will represent the number of defects in the fabric. The 
neural network uses Log sigmoid algorithm as transfer 
function [34]. Mean of sum of squares of the network 
weights and biases is used for performance function mention 
in equation 1 and 2. The three layer neural network will be 
trained using the following function: 

T = {(Xi,D0 N (1) 

Where Xi = input vector of i th example 

Di = desired (target) response of i th example 

N = Training set size 

Given the training sample T, the need is to compute the 
parameters of the neural network so that the actual output yi 
of the neural network due to Xi is close enough to di for all i 
in a statistical sense. For example, we may use the Mean- 
Square Error (MSE) as the index of performance to be 
minimized. 

N 

E(n)=lj:(Di-yi) 2 (2) 
Ni=l 



VI. CONCLUSION 

From the above reviewed literature, it can be resolved that 
various defect detection and classification techniques have 
been deployed to find out the defects from various fabrics. 
Firstly different computer vision techniques were employed 
individually to find out the defects. This was not helpful as it 
lacks in defect detection accuracy, computational time 
increase and results were not accurate. Later on various 
artificial intelligence methods were combined with image 
processing techniques to overcome the above loop holes in 
research. Some of the techniques are KNN, Bayesian, SVM, 
Radial basis function, Gabor wavelet, Feed forward neural 
network algorithm, PNN etc. The core ides of these 
methodologies along with their drawbacks / critics were 
discussed. . In order to identify the formation and nature of 
the defects, it is important to accurately localise the defective 
regions rather than classifying the surface as a whole. This 
can help us to classify the defects and for further studies. 

The proposed work can be used to detect and classify the 
defect for the low scale industries. In this research we have 
used image processing technique with neural network 
decision tree classifier to identify the several defects like 
hole, scratch, missing yarn, slub etc. We can achieve success 
up to 90%-94% to identify the defects. To obtain the better 
results we can combine both statistical and spectral approach 
together. 
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Abstract- Autism is one of the most prevalent chronic disease; it is 
a developmental disability that causes problems with 
communication and social interaction shared with repetitive 
demeanour. Some behaviors could occur before age three and 
appear through delays in diverse skills which develop from 
childhood to adulthood, even though behaviors may optimize 
over time. The disorder includes a large spectrum of symptoms, 
levels of impairment, and skills. It varies in severity from a 
handicap which somewhat limits an otherwise normal life to a 
devastating disability which may need institutional care. 
However, optimization of health care is probable to have a good 
effect on progress of disease, quality of life, and functional 
outcome. This paper will design and built a reliable system based 
Electronic healthcare technologies to care with autistic child. It is 
centralized database which contains the patients profiles; the 
system will enable the nurse to access the database and enter 
details of patients for subject them to a set of tests ; then these 
data will be held on a database and the system will identify the 
type of autism to determine the suitable treatment to each type. 
This proposed clinical system will serve the patient and health 
care providers by reducing healthcare costs and enable parents to 
help their autistic children by using the best way to treat them. 

Keywords- ASDs, Autistic children, Neurobehavioral, Clinical 
system, Ehealth care systems, Pediatricians. 

1 Introduction 

Autism is a spectrum of strongly related disorders with a 
common core of symptoms. It appears in early childhood, 
causing delays in several important areas of development like 
learning to speak, move, and interact with others. Most of 
autistic children have a learning disability, also known as 
mental retardation. There are many theories have been proven 
that the origins of autism may be caused by a genetic 
susceptibility to an environmental [1]. The children who have 
autism may look normal but their behavior could be downright 
difficult. The signs and symptoms of autism vary widely in 
individuals with autism , as do its effects. Some of children 
have only mild impairments, but others have more obstacles to 
overcome [2]. Generally, every child with autism shares 
problems with each other, at least in these three areas: Relating 
to others and the world around them, communicating verbally 
and non- verbally, and behaving and thinking flexibly [3]. 

There are different opinions among parents, physicians, 
and experts in Autism Spectrum about what causes this disease 
and how best to treat it, and they still don't know. Fortunately 
recent advances in technology have opened up the doors to 
enable scientists to tackle the issue. This paper will introduce 
the understanding of Autism Spectrum, its types, symptoms, 



causes and how best to treat it by using a clinical system which 
diagnosis and treats the children with ASDs. This system will 
manage this disease and help those children who need a careful 
care from their families or their carers 

1 . 1 Autism Understanding 

Autism is a neurodevelopmental disorder that typically 
involves delays and impairment in developmental language, 
behavior, and social skills. And it refers to people having 
dissimilar behaviors along a spectrum. It describes qualitative 
differences and weakness in reciprocal social interaction, 
combined with repetitive behaviours. According to the criteria 
defined in the international Statistical Classification of 
Diseases, Autism spectrum are diagnosed in children, young 
person and adults [4] . It is a lifelong disorder that has important 
impact on the child or young people and their parents or 
carers. Patients with Autism may have a deep sense of relief 
that others agree with their concerns and observations [5]. 
Diagnosis and the evaluation of needs could help an 
understanding the obstacles which face the patients and can 
open roads to support and services in social care and health 
services into voluntary organisations and make contact with 
other children and parents who have similar experiences; all of 
these can optimize the lives of the children and their carers. 
However, because of children special social problems, 
nowadays many of tools available to evaluate social skills, 
such as Ehealth Care Systems, and to ensure efficient system 
which serve the children with ASD, we need to determine what 
each individual needs 

1 .2 Autism Types and Symptoms 

The Autism disorder ranges in severity from a handicap 
which impedes normal life to other disability may be 
devastating that may require rapid intervention and special 
care. Children with autism disorder have trouble 
communicating and they are very sensitive, they have difficult 
to understand the others think and feel . Also they may be 
affected quickly from influences that surround them like: 
touches, smells, sounds, and they sometimes find it difficult to 
express on the pain [6]. Basically, there are different types of 
ASDs have been determined by guidelines in the diagnostic 
manual (DSM - IV) of the American Psychiatric Association. 
And according to the CDC, the three major types of Autism 
Spectrum Disorders are: Asperger's syndrome, Pervasive 
developmental disorder, and Autistic disorder [7]. 
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Children with autism have trouble picking up on subtle 
nonverbal cues and using body language; this makes it very 
hard for them to express themselves, facial expressions, and 
touch, and makes the reciprocal social interaction very difficult 
[8]. However, early signs of autism and symptoms in toddlers 
in autism like: 

They don't make eye contact normally. 

They don't respond to the sound of a familiar voice. 

They don't use gestures to communicate. 

They don't make noises to get your attention. 

They don't track objects visually. 

They don't respond to cuddling. 

They don't imitate your facial expressions or movements. 
They don't attract or play with other people. 

1.3 Autism Causes 

Autism is defined as a complex developmental disorder; 
experts believe that Autism appears in the first three years of a 
child's life as a result of a neurological disorder which has an 
effect on normal brain function, and affecting social interaction 
skills [9]. Genomic research discovered the that the children 
with ASDs a probably share genetic traits with individuals with 
attention- disorder, schizophrenia, or clinical depression [10]. 
However ASDs has no single known cause, and there are 
probably many causes may play a role like both genetics 
problems, as a result of a malfunction in the brain ; and 
environment factors, such as problems during pregnancy , viral 
infections, and other [11]. 



1.4 Autism Treatment 

The main goals of treatment are to enhance the quality of 
child's life by minimizing the core ASDs features, promoting 
socialization, reducing maladaptive behaviors, and guiding and 
supporting parents So it is essential to find helpful services, 
education, and treatments for autistic children. Physicians have 
an key role in early recognition, evaluation of autism disorders, 
and also in chronic management of these disorders. 
Treatment and care should take into account the needs of 
children, and their families and carers who care for them [12]. 
They should participate with health care professionals for the 
success the treatment process of their children. However, there 
are many treatments that can assist autistic children to learn 
new skills and overcome their disability. According to the 
National Institutes of Health treatments for autism disorders 
can involve the following: Behavioral management therapy, 
Educational therapies, Cognitive behavior therapy, Medication 
treatment. In addition other treatments and therapies that have 
been used for autistic children, such as: Speech and language 
therapy, The Picture Exchange Communication System, Music 
therapy, Vitamins and mineral supplements, and Massage 
therapy[13]. 

In the proposed system, the treatments for autistic children 
are include the chemo therapy and physical therapy as shown 
in system interfaces section. 
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The aim of the proposed system is to serve health care 
providers by procedure a set of tests to identify the type of 
autism, and make a suitable decision based on the collected 
data to identify the required treatment for each autism type. 
This system is used to manage and monitor autism disorders in 
autistic children and it is efficient due to reduce the costs of 
health and enhance the communication between a patient and 
providers of a health care. 

2.1 Brief System Description 

The scope of this work is to build a reliable system based 
Electronic healthcare technologies; the clinical database of 
management system of children with autism stores details of 
the users related to this system . It has four users which were 
identified as: doctor, nurse, admin , and technician; each one 
has collection of functions in side system. The nurse has the 
ability to register, login in, login out, add - edit patient account, 
research patient profile , and test patient state. While the doctor 
has ability to register, login in, login out, search patient profile, 
view the result of test and the treatment, print the medical 
prescription of treatment, and follow up the patient case during 
duration of treatment. Whilst the admin has ability to register, 
login in, login out, add, edit user account, and manage the 
system. Lastly the technician is able to register, login in, login 
out, and check system hardware and buck up the system. 



Treatment Center 



Patient 



■ 






* > 




Test the patient 



Figure 1 : System Architecture for Management System of Children 
with Autism 

2.2 MoSCoW Prioritisation 

The technique of MoSCoW Prioritisation helps to produce 
a strong system [14]. It will apply in the Management System 
of Children with Autism. First of all, we will do several 
interviews with health care providers like doctor, nurse and 
families of patients to identify the functional requirements of 
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the clinical system. These requirements will identify according 
to the priorities as the following: 

• Input staff information (Should). 

• Add new Patient account (Must). 

• Tests patient state (Must). 

• Determine the type of disease (Must). 

• Determine the treatment for each type of disease 
(Must). 

• Print the medical perception of treatment (Should). 

• List all appointments for the coming week (Should). 

• List all staff (Should). 

• Report on historical appointments (Could). 

2.3 Data Flow Diagram 

The data flow diagram (DFD) is a tool that shows the data 
flow through a system and the work and processing performed 
by that system. It is used to help understand the existing system 
and to represent the required system. The diagram as shown in 
figure 2 represents the external bodies sending and receiving 
information[15], [16]. 



Nurse 




Adiiiiu 



Figure 2: Data flow diagram of the system 

2.4 User Involvement 

During the development this system, it is essential to 
gather information about users who related to the system [17]. 
In Management System of Children with Autism four users 
were identified like: doctor, nurse, admin, and technician and 
each of them has several responsibilities and roles as shown in 
table 1. 
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Table 1 : Roles of users 



Actors 


Roles 


Nurse 


- Register 

- Log in and log out 

- Add Patient account 

- Edit Patient account 

- Search patient account 

- Test patient state 

- Determine the type of disease 


Doctor 


- Register 

- Log in and log out 

- Search patient account 

- View the result of test and the 
treatment 

- Determine the treatment for each 
type of disease 

- Print the medical prescription of 
treatment 

- Follow up the patient case during 
duration of treatment 


Administrator 


- Register 

- Log in and log out 

- Add, Edit user account 

- Manage the system 


Technician 


- Buck up the system 



3 Proposed System Design 

The purpose of design phase is to investigate what the 
software will look like, both graphically and functionally. Its 
purpose is to create a technical solution which satisfies the 
system functional requirements [18]. In this phase, we will 
design the clinical database of the management system of 
children with autism. It considers very important phase, a well- 
designed database provides the correct information for the 
medical decision-making process to succeed in an efficient 
way. However, database design can be used to describe many 
different parts of the design of an overall database system. 

3.1 Use Cases Models 

Use Case model illustrates a set of possible scenarios 
related to a particular goal. Each use case represents an action 
the system is required to allow an associated user to achieve. It 
allows the definition of the system's boundary, and the 
relationship between the system and outside of the system [19]. 
In Management System of Children with Autism, there are 
four users in the use case diagram. Each user have many 
functions in the system as shown in figure 3. 
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Manage the 




System 



Figure 3: Use Cases Model 
3.2 Implementation phases 

In the clinical database design of the Management System 
of Children with Autism, three main phases will be create: 
conceptual, logical and physical design. 

The first stage describes the relation and the connectivity 
between all components of the system. While the logical data 
model consists of specified classes that will become tables 
which includes six tables as shown in table 2; then the 
attributes of tables will become fields, and the associations 
become relationships. Lastly, the final phase will translate the 
logical database into a physical database [20] . 



Table 2: Contents of Management System of Children with Autism 



No. 


Table Name 


Description 


1. 


Patient 


It contains information related 
to the patient 


2. 


Disease 


It contains information about 
the disease and it's types 


3. 


Symptoms 


It contains information about 
the symptoms of the disease 


4. 


Treatment 


It contains information about 
the treatment of the disease 


5. 


physical 
therapy 


It contains information about 
physical therapy of the disease 


6. 


Chemo 
Therapy 


It contains information about 
chemo therapy of the disease 
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3.3 Used Tools 

In Management System of Children with Autism, 
Microsoft visual studio 2010 and Microsoft sql server 2008 
were used to be an easy yet powerful way to create 
management of system that actually interact with the users. 
Microsoft visual studio 2010 was used to design the interfaces 
of the system and implement the Programming code. While 
Microsoft sql server 2008 was used to create the database, its 
tables , and data and connect it to the system. 

4 System Interfaces and Results 

The interfaces and results that have been obtained through 
the implementation of Management System of Children with 
Autism will be showed in this section: 

1. Login Interface: The first step is to login in to the system by 
admin and nurse by enter their own username and password as 
shown in figure 4; then the page as shown in figure 5 will 
appear which contains the database of the patient which 
includes two options: add new patient, and edit patient data. 




Figure 4: Login interface 
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Patient Managment 





| id 


name 


age 


gender 


email 


mobile 


address 


state 








11 


Ammar Majeed 


■ 


male 


amnnar @yah . . . 


07801166554 


Karbala 


Normal 








12 


Ali Hassan 


5 


male 


ali@yahoo.com 


07801166987 


Hilla 


Normal 








13 


Moharned Ab. . . 


S 


male 


abood@yaho. . . 


07806584356 


Najaf 


Normal 








14 


Ahmed Hussain 


7 


male 


Ahmed @yah. . . 


07803698412 


Basrah 


Normal 








15 


Iehab Ali 


5 


male 


Iehab@yaho. . . 


07803697351 


Karbala 


Normal 








16 


Ali Jasim 


7 


male 


jasim @yahoo. . . 


07706598365 


Koot 


Normal 




*\ * 




17 


Hussain Jubair 


6 


male 


hussain @yah. . . 


07706598145 


Najaf 


Normal 








IS 


Muntadhar M. . . 


4 


male 


muntadhar®. . . 


07701265985 


Karbala 


Normal 








19 


yuosif Naji 


5 


male 


yousif@yaho. . . 


07801565423 


Arnerli 


Normal 








2D 


AJi Falah 


4 


male 


falah @yahoo. . . 


07801566914 


Najaf 


Normal 




i 




21 


Hassan Falah 


6 


male 


hassan@yah. . . 


07703641265 


Karbala 


Normal 








22 


Ameer Salah 


4 


male 


Ameer @yaho. . . 


07716984265 


Najaf 


Normal 


















L-=rh=l = 




V 



Add a new patient 



Edit patient data 



Figure 5: Database of the patient 



2. Figure 6 shows the process of adding new patient account 
completed successfully. While figure 7 shows the process of 
updating also completed successfully. 





Figure 7: Updated successfully 



Figure 6: Added successfully 
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3. Tests interface: when the nurse select option tests in the main 
page, the page of tests will appear as shown in figure 8 to 
identify the type of autism. While figure 9 shows the interface 
of the tests results and identify the type of disease. And Figure 
10 shows the details of each type of autism disease. 





id 




age 










date 


state 




12 


Ah Hassan 


5 


male 


ali@yahoo.com 


07301166937 


ma 


3/31/2015 6:55:53 AM 


Normal 




13 




6 


male 


abood@yahoo.com 


07306534356 


Najaf 


3/31/2015 6:56:10 AM 


Normal 




14 




7 


male 


Ahmed @yahoo . com 


07803698412 




3/31/2015 6:56:18 AM 


Normal 




15 


Iehab All 


5 


male 


Iehab@yahoo.com 


07303697351 


Karbala 


3/31/2015 6:52: 59 AM 


Normal 




16 


At Jasbn 


7 


male 


jasim@yahoo.com 


07706598365 


Koot 


3/31/2015 6:53:45 AM 


Normal 




17 


Hussain Jubair 


6 


male 


hussain@yahoo.com 


07706593145 


Najaf 


3/31/2015 6:54:46 AM 


Normal 




18 




4 


male 


mun tadhar @ yahoo . com 


07701265985 


Karbala 


3/31/2015 6:55:37 AM 


Normal 




19 


yuosifNaji 


5 




yousif@yahoo.com 


07301565423 




3/31/2015 6:57: 14 AM 


Normal 











Disease Name 
Disease Symptoms 



r 
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4. Figure 11 shows the interface of treatment for each type of 
disease. While figure 12 illustrates treatment of Chemo 
Therapy. Lastly figure 13 illustrates the medical prescription 
of treatment which will print to the families patients. 



Disease Name 
Physical Therapy 




Figure 1 1 : Treatment for each type of disease 
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Drug name 
Drug amount 



ChemoTherapy 



10 AM r 4 PM r 10 PM 



10 AM r 4 PM r 10 PM 



»ll 



I Asperger's Syndrom 





id 


drug_narne 


drug_amount 


Addjerat 


notes 




1 


Prasetol 


3 


10 AM r 4 PM r 1... 


None 




2 




Panadol 


2 


8 AM j 2 PM 


None 




5 




FlowOut 


3 


10 AM , 2PM , 8... 


None 




6 




Algisic 


3 


10 AM j 4 PM x 1. . . 


None 




15 




Samalin 


2 


10 AM s 6 PM 


None 

















Figure 12: Treatment of Chemo Therapy 



Patient name and His Diss 






ase Type 




Patient_name 




Disease_name 


v^f 1 ali hassn 




Asperger's Syndrome 





Dr. Ade 



m 



PhysicalTherapy 




ChemoTherapy 




Ph_name 






drug_name 


drug_amount 


Addjerat 


notes 


► 


L Hiring a behavioral therapist to assist with your Aspie child at home 






Prasetol 


3 


10 AM P 4 PM r UO PM 


None 




2. Social skills therapy 






Panadol 


2 


8 AM j 2 PM 


None 




3. Healthy r well-balanced diet 






FlowOut 


3 


10 AM j. 2 PM j. 8 PM 


None 




4. Natural herbal remedies [as part of the overall treatment strategy 






Algisic 


3 


10 AM , 4 PM , 10 PM 


None 




5. Parent education and training 






Samalin 


2 


10 AM t 6 PM 


None 












Figure 13 : Medical prescription 
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5 Conclusion and Future work 

This section will focus on the conclusion and future work for 
Management System of Children with Autism. 

5.1 Conclusion 

Good communication between healthcare providers and 
autistic children is essential; It should be supported by 
evidence-based written information customized to the needs of 
the child and their parents or carers. This paper has been 
presented the design and implementation of a reliable system 
based ehealth care technologies for autism spectrum 
management care. The aim of this work was to provide a 
description to the basic building blocks involved in the Autism 
management system. It has been provided the autistic children 
with reliable ways to manage their disease even outside clinic 
doctor. The system produced a set of tests to identify the type 
of autism. And then it will allow the health care provider to 
make an informed healthcare decision based on the collected 
data to determine the required treatment for each type. Lastly, 
the recommendations of treatments are given to the parents of 
child by printed medical prescription. 

However, after implementing this system; it will be capable 
of achieving the following: 

• It is to enhance the provided services to patients by 
making their records available for doctor to follow up 
the case easily with less effort; 

• Autism Spectrum management has dependability and 
consistency execute the required functions of 
software; 

• It provides the best control of the patients status based 
on their test results. 

Lastly, we can say when the autistic children find the 
assistance to meet their special needs, with availability the 
correct treatment plan, and a lot support they will educate and 
improve. 



5.2 Future work 

This section will discuss future improvements to 
Management System of Children with Autism by adding the 
additional functionalities to this system : 

• Apply this system on the web; 

• Enabling the patient to take appointment with the 
doctor before go to the health centre; 

• Allow the families and carers who care for autistic 
children to access into the database to enter families 
patients requires and enable them to receive the latest 
developments and recommendations to follow up the 
medical advices by practitioners. 
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Abstract- Requirements traceability is an essential step in 
ensuring the quality of software during the early stages of its 
development life cycle. Requirements tracing usually consists of 
document parsing, candidate link generation and evaluation 
and traceability analysis. This paper demonstrates the 
applicability of Statistical Term Extraction metrics to 
generate candidate links. It is applied and validated 
using two datasets and four types of filters two for each 
dataset, 0.2 and 0.25 for MODIS, 0 and 0.05 for CM1. 
This method generates requirements traceability 
matrices between textual requirements artifacts (such as 
high-level requirements traced to low-level 
requirements). The proposed method includes ten word 
frequency metrics divided into three main groups for 
calculating the frequency of terms. The results show that 
the proposed method gives better result when compared 
with the traditional TF-IDF method. 

Keywords- Requirements Traceability; Traceability Analysis; 
Candidate Link Generation; Parsing; Term Extraction; Word 
Frequency Metrics. 

I. Introduction 

The traceability of requirements was introduces mainly to 
manage and document the life of requirements. Its major 
objective is to maintain the activities of critical software 
development, for instance, the assessment of whether a 
software system has satisfied its definite set of requirements, 
the verification that all requirements have been employ by 
the end of the lifecycle, and the analysis of the impact 
imposed by the proposed changes on the system [1]. 

It is usually essential to follow the changes of 
requirements all the way through the lifecycle of software. 
All requirements should be validated in and at the end of 
each phase of the lifecycle. Traceability matrices are usually 
constructed to show the satisfaction of requirements by the 
design [2]. 

Generating traceability links (or traceability matrices) is 
fundamental to many software engineering activities [3]. But 
it is a time consuming, error prone, and mundane process. 
Most frequently, traceability matrices are built manually. 
When an analyst tries to trace a high level requirement 
document to a lower level requirement specification, he may 
have to look through M x N elements, where M and N are the 
number of high and low level requirements, respectively. 
Keeping in mind that there are very few tools available to 
assist the analysts in tracing unstructured textual artifacts, 
and those require enormous pre-processing [2]. 

Verification and Validation (V&V) and Independent 
Verification and Validation (IV&V) are used to ensure that 
the right processes have been used to build the right system. 
That is why it must be verified that the agreed processes and 
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artifacts are directing the development in each phase of the 
life-cycle, in addition to ensuring that all requirements have 
been implemented at the end of the lifecycle. A requirements 
traceability matrix (RTM) is necessary for both of these 
[4][5]. 

The automatic generation of traceability links requires 
Information Retrieval (IR) techniques to reduce the time 
needed to generate the traceability mapping [3]. 

Requirements tracing usually enclose: document parsing, 
candidate link generation, candidate link evaluation, and 
traceability analysis. There are two commonly used measures 
for evaluating candidate link lists: recall and precision. In 
candidate link evaluation, the analyst investigates the 
candidate links and determines those that are actual (true 
links), and those that are not (false-positives, bad links). To 
achieve this, the analyst visually inspects the text of the 
requirements to find out the meanings of the requirements, 
compare them, and decide based on his believes which 
meanings are adequately close. This decision is based on 
human judgment and tolerates all the advantages and 
disadvantages that are related to it [4] [5]. 

When tracing is finished, reports are generated by the 
analyst stating the high level requirements that do not have 
children and the low level elements that do not have parents 
(traceability analysis) [4] [5]. 

II. Related Work 

Many researchers have presented their work in 
requirement tracing during the last few years, such as: 

In 2004 Hayes, et al. [5] designed RETRO to support the 
IV&V analyst in requirements tracing to find and evaluate 
candidate links. 

Also in 2004, Sundaram, Hayes, and Dekhtyar [6] studied 
a mixture of IR methods used to solve the requirement 
traceability problem. They found that existing IR methods 
can be used in automating the generation of candidate links 
with minimal modification. And that the analyst's feedback 
information can considerably improve requirements tracing. 

By 2006 Hayes, Dekhtyar, and Sundaram [4] inspected 
the efficiency of information retrieval methods in automating 
the tracing of textual requirements. They found that feedback 
from analyst improves final results via objective measures. 

In 2007, Sundaram [2] assisted analysts in the traceability 
links generation process with information retrieval 
techniques for improving the quality of the generated links in 
addition to time saving. 

Finally in 2010, Sundaram, et al. [3] stated that 
Information Retrieval techniques have been shown to aid in 
the automated generation of links through reduction of the 
time used in generating the mapping of traceability. 
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Researchers have successfully used techniques such as Latent 
Semantic Indexing (LSI), Vector Space Retrieval, and 
Probabilistic IR. 



III. Requirement Tracing 

Requirements tracing is defined as the ability to describe 
and follow the life of a requirement, in both a forward and a 
backward direction, through the whole systems life cycle [2]. 

During the process of requirement gathering, the analyst 
has to clarify customer needs, conduct feasibility studies, 
specify a solution, and cross validates the specifications [7]. 

In large-scale projects, it is quite possible to miss or 
misinterpret some of the recognized requirements. More than 
80% of the failures in large-scale mission-critical projects are 
caused by undetected problems in the early phases of the 
software development lifecycle [8]. An unobserved problem 
at the start of the project can continue all the way through to 
the deployed product; and becoming a latent defect or latent 
error [7]. 

Two sets of documents are typically created in the early 
phases of any software project: 

• Software Requirements Specification SRS 

It is defined as "documentation of the essential 
requirements (i.e., functions, performance, design 
constraints, and attributes) of the software and its 
external interfaces. The software requirements are 
derived from the system specification [7]. SRS is a 
"binding contract among designers, programmers, 
customers, and testers," it includes different design 
views or paradigms for system design [9]. 

• Software Design Description SDD 

The design activity is used to identify the components of 
the software design and their interfaces from the 
Software Requirements Specification. The principal 
artifact of this activity is the Software Design 
Description (SDD) [9]. It is a "representation of software 
created to facilitate analysis, planning, implementation, 
and decision making". It is used as a medium for 
communicating software design information, and may be 
viewed as a blueprint or model of the system [7]. 

At the end of a requirements tracing process, a 
requirements traceability matrix (RTM) is generated [2]. 
RTM acts as a tool for indicating the way that the design and 
implementation elements deal with requirements throughout 
the whole software development lifecycle [7]. 
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representations of user information needs (queries). Nearly 
all IR methods are keyword-based: the document and query 
representations contain information regarding the importance 
of particular keywords found in the document [10]. There is a 
broad array of keyword-based retrieval models meant for 
document collections. The Boolean model is the simplest: a 
representation of a document is a Boolean vector identifying 
the keywords found in the document. A Vector model 
broadens the Boolean model by correlating each term in the 
document representation with a weight that signifies its 
understood importance to the document collection [11]. 

Documents and queries are represented as a vector of 
keyword weights. Formally, let V = {k b ..., k N ) be the 
vocabulary of a given document collection. Then, a vector 
model of a document J is a vector (w b w N ) of keyword 
weights, where Wi is computed as in Eq. (1) [10] [11]. 

w i = tf i (d)-idf i (1) 

Where 

tfi(d) is the term frequency of the ith keyword in 
document d , 

idfi is the inverse document frequency of the ith term in 
the document collection. 

Term frequency is the number of term occurrences in the 
document and is usually normalized. The Inverse document 
frequency is computed using Eq. (2) [10][11]. 

id n = i °32 (2) 

Where 

df is the total number of documents containing the ith 

term in the document collection, and 
n is the size of the document collection. 

The term significance is judged by how often this term is 
located in the document and by how discriminating the term 
is. That is, less frequent terms have more important presence 
for the document. A user query is also converted into a 
similar vector q=(q lf ...,q N ) of term weights. In this model, 
given a document vector d and a query vector q, the 
similarity between them is computed as the cosine of the 
angle between vectors d and q in the N- dimensional space as 
inEq.(3)[10][ll]. 



sim(d, q) = cos(d, q) = 



viV 

I>i =1 Wi-qi 



•(3) 



IV. Information Retrieval (IR) for Requirements 

Tracing 

Information retrieval (IR) is the process of discovery 
documents relevant to an information request in a collection 
of documents, usually a search query [7]. 

The main issue in IR is the determination of relevant 
documents in document collections given user- specified 
information needs. Most IR methods work by converting 
each document in the collection into a mathematical 
representation to capture the information content of the 
document, after that a comparison is conducted with similar 



V. Employd Filters 

In this work, four filters are introduced to generate 
candidate link lists with relevance higher than one of the 
predefined levels: 0, 0.05, 0.2, and 0.25. This filtering acts 
as an assessment of the quality for the candidate link list. 
Having two candidate link list, say list X and list Y, with the 
same recall and precision, in that case if the true links show 
up at the top of list X compared with list Y, then obviously 
list X have preference to list Y from the analyst standpoint 
[2]. 
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VI. Measuring toe Efficiency 



To evaluate the efficiency of JR techniques, recall (R) 
and precision (P) are used as the primary measures, recall 
measures if a method succeeded in finding all the high-low 
level requirement pairs that trace to each other, while recall 
indicates the number of additional pairs found by the method 
that do not trace to each other [6]. 

The computation of recall is done by dividing the total 
number of relevant retrieved documents by the total number 
of relevant documents in the complete collection, as in 
Eq.(6) [12]. 
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Through the next subsections the following notations are 
used to symbolize equations: tf^ is the frequency of term / in 
each document j, N is the size of corpus. w x is the weight of 
term /. 



R = 



#of_relevant_ retrieved 
#_relevant_in_collection 



..(6) 



The precision is calculated as the total number of 
relevant retrieved documents divided by the total number of 
retrieved documents, as shown by Eq.(7) [12]. 



V = 



#of_relevant_retrieved 
#_retrieved 



..(7) 



VII. Term Extraction 

Term extraction forms an important issue in natural 
language processing; its goal is to extract sets of words with 
precise meaning in a collection of text. More than a few 
linguists considered these terms to be the base semantic unit 
of language. Automating term extraction comprises machine 
translation, automatic indexing, building lexical knowledge 
bases, and information retrieval [13]. 

Both supervised and unsupervised techniques have been 
used in earlier investigations to extract and distinguish 
terms. Nearly all researches aimed at locating the most 
significant set of terms from a domain corpus, to be precise, 
the set of superficial representations of domain concepts that 
better symbolize the domain for a human expert [14]. 

Term frequency in a corpus is a basic statistical property. 
This may then be compared to the frequency of the term in 
other corpora, such as balanced corpora or corpora from 
other domains. Basic frequency counts are integrated to 
compute co-occurrence measures for words. Co-occurrence 
measures are employed to estimate the propensity for words 
to appear together as multi-word units in documents, and to 
estimate the likelihood that units on either side of a bilingual 
corpus correspond under translation [15]. 

Term extraction can be used in this work to solve two 
issues: 

• Finding high and low level requirements to create a 
common vocabulary. This is carried out using Statistical 
approaches, where all the terms are placed in a common 
vocabulary without any repetition. 

• Using Statistical Term Extraction Metrics to calculate term 
weighting instead of TF-IDF in information retrieval. 



VIII. 



Statistical Term Metrics 



In this work, ten standards metrics are proposed each as 
a measure instead of that used in the TF-IDF method, which 
was mentioned in Eq.(l). These metrics are divided into 
three main groups as explained in the following subsections 
[16]. 



A Term Frequency Based 

The majority of term extraction algorithms base their 
results on some computation concerning term frequency. 

1 ) Corpus Term Frequency 

This metric is a solely term frequency metric, calculated 
over the entire corpus. It focuses on words that appear 
more often, except that it consequently favors large 
documents. Eq.(8) shows this calculation [16]. 



Wi=I, N =1 t/y 



(8) 



2 ) Logged Term Frequency 

Logarithms are considered as powerful modifiers of data, 
as they can reduce the range of values in a set. 
Logarithms are used to reduce the range of terms in any 
given document. This dampens the data, decreasing the 
distribution of frequencies as in Eq.(9) [16]. 



wt= I, N =iln(t/y + l) 



(9) 



3) Document Term Frequency 

The maximum term frequency in a document is a unique 
metric, where the words that appeared most within their 
respective document are selected instead of summing 
together all the term frequencies. This is normalized, so 
as not to penalize words in short documents. This may 
provide new terms to the vocabulary by finding terms 
that appear often in one document, but not in any of the 
others. It favors unevenly distributed word frequencies, 
the calculation is done according to Eq.(lO) [16]. 



..(10) 



Wi = max tfu 

l<j<N J Jm 



B. Normalization Based 

Term normalization forms a standard metric for 
information retrieval; it is carried out by dividing the 
frequency of a term by the total number of terms in a 
document. When each document is normalized, the effect of 
size is removed, and each term frequency will form a 
percentage of another characteristic of the document, such as 
the document's term count [16]. 
1 ) Document Terms Counts 

The widespread normalization of a document is carried 
out by dividing a term's frequency by the number of 
terms in a document [16]. Assuming Tj to be the total 
term count in document j, Wi is calculated as in Eq.(l 1). 



w, 



i=Ij-iU'u/Ti (ID 



2) Document Maximum Frequency 

In this metric, the term frequency is divided by the most 
frequent term in a document, and the results are then 
summed up. The most frequent word gets a score of one 
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for the document for which it is the most frequent term, 
in addition to any score it obtains by occurring in other 
documents. This has a similar effect to normalization 
because the score given to a term from any single 
document will not be greater than one, but the scores 
resulting from each document will be different than the 
scores after standard normalization. The weight w of 
term contributions is a ratio of the term frequency to the 
most common term Pj, rather than the frequency to the 
document size. Eq.(12) depict this [16]. 

w i = E J r l 1 tf ij /P j (12) 

a) Document Maximum Frequency & Term Average 
Frequency 

This metric also employs normalization according to 
the most frequent word in the document Pj, but here 
the average frequency that term i appears across as 
documents in the corpus is subtracted from Eq.(12). 
This is calculated as inEq.(13) [16]. 



w i = (iu t fij/ p jy 



Xlit fij 



(13) 



3) Corpus Maximum Frequency 

The previous maximum frequency normalization 
technique can be further explored by using the most 
frequent term in the corpus. Being fixed, the corpus's 
most common term is a constant P c . Results should be 
similar to the results of term frequency, if not exactly the 
same [16]. This metric is sometimes called corpus 
relativized. Eq.(14) shows this calculation. 



i = Ejiitf ij /P c (14) 



a) Corpus Maximum Frequency & Term Average 
Frequency 

Referring back to the previous metric, the 
normalization was based on the most frequent term in 
the corpus, this metric is corpus relativized minus the 
average of TF as inEq.(15) [16]. 



N tf ik Sill 



tfii 



(15) 



C. Inverse Document Frequency 

The inverse document frequency measures desire words 
appearing in very few documents. It is used employed 
frequently in indexing; this is due to the fact that indexed 
documents in the corpus are in general varied, so a term that 
appears in few documents is a good identifier for those 
documents. Inverse Document Frequency together with term 
frequency assists in selecting words that occur repeatedly, 
but only in few documents [16]. 

1 ) The TF-IDF 

Here, the term frequency is multiplied by the number of 
documents in the corpus, which is divided by the number 
of documents (ni) that contain the term [16]. Eq. (16) 
shows the weight calculation. 



w i =J$ =1 tf i] *N/n i . 



(16) 



2) Logged IDF 

This is similar to the TD-IDF measure, but here the term 
frequency is weighted more highly. The logarithm 
decreases the range of IDF values as in Eq.(17) [16]. 

Wi=S J =i(tf|j)*ln(^) (17) 



IX. DATASETS 

This work is validated using two NASA open source 
datasets. Both MODIS and CM-1 datasets are used here to 
assess the utilized techniques of IR. The MODIS dataset 
consists of 19 high level and 49 low-level requirements, 
where the CM-1 dataset contains 235 high-level 
requirements and 220 design elements. A manual tracing 
was done for both datasets for verification; these are referred 
to as "answer sets" or "theoretical true traces". There were 
41 and 361 true links found for the MODIS and CM-1 
datasets, respectively [6]. 



X. Experimental Results 

Term Extraction is presented in this paper, along with a 
discussion of the Preprocessing techniques that are 
commonly used. First, the documents are parsed using the 
Statistical approach, stop words (words such as 'the' and 
'of) are removed, and each remaining term is stemmed 
using Porter's algorithm [17], the term frequency is 
computed using ten word frequency metric rather than TF- 
IDF. In this paper the vector space model is used for 
Information Retrieval. 

The four filters were used together with the metrics 
described previously using MODIS and CM1 datasets. The 
results are compared with those found by Sundaram et al. 
[6]. 

A. First Dataset (MODIS) with Filters (0.2 and 0.25) 

In this section experiments are done using the MODIS 
Dataset and filters (0.2 and 0.25). Table (I) and (II) show 
the results of running the ten metrics for each filter. It was 
found that: 

• Filter 0.2, recall value for all metrics improved, the best 
value was (68.2) achieved by the Document Term count 
metric and is labeled with (*) in Table (I). Best Precision 
is (23.7) in Term Frequency - Inverse Document 
Frequency metric. 

• Filter 0.25, Recall improved for nearly all metrics except 
for Document Term Frequency, best value was (68.2) 
achieved by the Document Term count metric and is 
labeled with (*) in Table (II). Best Precision is (21.6) in 
Term Frequency - Inverse Document Frequency metric. 
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Table 1 .Result of metrics in MODIS dataset with filter (0.2) 



Format 


Term Weighting 


Recall 


Precision 


XML[6] 


TF IDF 


19.5 


21.6 


Statistical 


Corpus Term Frequency 


65.8 


13.5 


Statistical 


Logged Term Frequency 


65.8 


14.2 


Statistical 


Document Term Frequency 


243 


7.6 


Statistical 


Document Terms Counts 


682* 


17.1 


Statistical 


Document Maximum Frequency 


63.4 


16.5 


Statistical 


Document Maximum Frequency and 
Term Average Frequency 


65.8 


17.0 


Statistical 


Corpus Maximum Frequency 


65.8 


13.7 


Statistical 


Corpus Maximum Frequency and Term 
Average Frequency 


65.8 


13.4 


Statistical 


Term Frequency - Inverse Document 
Frequency 


34.1 


23.7* 


Statistical 


Logged Inverse Document Frequency 


65.8 


14.0 



* best value 



Table HResult of metrics in MODIS dataset with filter (0.25) 



Format 


Term Weighting 


Recall 


Precision 


XML[6] 




19.5 


32.0 


Statistical 


Corpus Term Frequency 


65.8 


16.0 


Statistical 


Logged Term Frequency 


63.4 


16.4 


Statistical 


Document Term Frequency 


17.0 


7.6 


Statistical 


Document Terms Counts 


682* 


19.3 


Statistical 


Document Maximum Frequency 


63.4 


19.5 


Statistical 


Document Maximum Frequency and 
Term Average Frequency 


63.4 


19.6 


Statistical 


Corpus Maximum Frequency 


65.8 


16.1 


Statistical 


Corpus Maximum Frequency and Term 
Average Frequency 


65.8 


16.0 


Statistical 


Term Frequency - Inverse Document 
Frequency 


195 


21.6* 


Statistical 


Logged Inverse Document Frequency 


65.8 


18.7 



* best value 



B. Second Dataset (CM1 ) with Filters (0 and 0.05) 

Here, experiments are done using the CM1 Dataset and 
filters (0 and 0.05). Table (III) and (IV) show the results of 
running the ten metrics for each filter. It was found that: 

• Filter 0, best Recall is (98.6) in Document Term 
Frequency and Term Frequency - Inverse Document 
Frequency metrics. Best Precision is (1.0) for all 
metrics as in Table (III). 

• Filter 0.05, best Recall is (95.2) in Term Frequency - 
Inverse Document Frequency metric, Best Precision is 
(1 .1) in Document Term Frequency, Term Frequency - 
Inverse Document Frequency and Logged Inverse 
Document Frequency metrics as in Table (IV). 



Table III.Result of metrics in CM1 dataset with filter (0) 



Format 


Term Weighting 


Recall 


Precision 


XML[6] 


TF IDF 


97.8 


1.5 


Statistical 


Corpus Term Frequency 


97.7 


1.0 


Statistical 


Logged Term Frequency 


98.0 


1.0 


Statistical 


Document Term Frequency 


98.6* 


1.0 


Statistical 


Document Terms Counts 


97.5 


1.0 


Statistical 


Document Maximum Frequency 


97.5 


1.0 


Statistical 


Document Maximum Frequency and 
Term Average Frequency 


97.5 


1.0 


Statistical 


Corpus Maximum Frequency 


97.7 


1.0 


Statistical 


Corpus Maximum Frequency and 
Term Average Frequency 


97.5 


1.0 


Statistical 


Term Frequency - Inverse Document 
Frequency 


98.6* 


1.0 


Statistical 


Logged Inverse Document Frequency 


983 


1.0 



* best value 



Table IV.Result of metrics in CM1 dataset with filter (0.05) 



Format 


Term Weighting 


Recall 


Precision 


XML[6] 




92.2 


4.3 


Statistical 


Corpus Term Frequency 


86.9 


1.0 


Statistical 


Logged Term Frequency 


87.5 


1.0 


Statistical 


Document Term Frequency 


93.0 


1.1 


Statistical 


Document Terms Counts 


86.1 


1.0 


Statistical 


Document Maximum Frequency 


86.9 


1.0 


Statistical 


Document Maximum Frequency and 
Term Average Frequency 


86.9 


1.0 


Statistical 


Corpus Maximum Frequency 


86.9 


1.0 


Statistical 


Corpus Maximum Frequency and 
Term Average Frequency 


86.1 


1.0 


Statistical 


Term Frequency - Inverse Document 
Frequency 


952* 


1.1 


Statistical 


Logged Inverse Document Frequency 


92.7 


1.1 



* best value 



In MODIS dataset, the Recall measure for both filters 
(0.2 and 0.25) showed better result for all metrics when 
compared to [6] except for Document Term Frequency in 
filter 0.25. Using the Precision measure, only Term 
Frequency - Inverse Document Frequency showed better 
results in filter(0.2), in filter 0.25 all of metrics showed less 
result than [6]. 

In CM1 dataset best value obtained in Recall measure 
was by using filter 0 and metrics (Logged Term Frequency, 
Document Term Frequency, Term Frequency - Inverse 
Document Frequency and Logged Inverse Document 
Frequency), which showed better results than [6], in filter 
0.05 the Document Term Frequency, Term Frequency - 
Inverse Document Frequency, Logged Inverse Document 



71 



http://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



(IJCSIS) 

Frequency were better than [6]. In Precision all of metrics 
showed less result than [6]. 

In this work, focus was on improving recall at the cost of 
precision because high-recall, low-precision lists of links 
appear to be more preferable than high-precision, low recall 
links [4] [5]. That is due to the fact that humans may be better 
at deciding if a specific pair of links in the list is a match 
than at finding new pairs of links in the document [5]. 

XL Conclusions and Future Work 

In this paper, the effectiveness of information retrieval 
methods in automating the tracing of textual requirements 
was examined. Ten metrics were evaluated and it was found 
that better recall can be achieved when compared to TF-IDF. 

In this work, the vector space model was adapted for 
each of the metrics, in addition to the Statistical format. 
Porter Stemming Algorithm was applied using two open 
source datasets (MODIS and CM1). 

Future work can carry on in several directions, such as 
the use of another technique in Information Retrieval (IR), 
as well as the vector space model. More methods can be 
sought to be employed other than term extraction to enhance 
results. Other datasets can also be used in this area. 
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Abstract — The fundamental step in implementing an eye- 
based interface is the exact localization of user eye pupil and 
iris and separating them from other parts of the eye. This 
paper presents a fast adaptive iris localization scheme that can 
be used by low resolution cameras or webcams. This method 
finds the pupillary and iris boundaries accurately with low 
computational cost under variable and noisy lighting 
conditions. Based on the fact that the pixels of the pupil are 
darker than other regions, this method locates its boundary 
circle and then uses it to localize the outer boundary of the 
iris. The first step to identify the inner boundary is to calculate 
a threshold value to separate the pupil's pixels from other 
parts of the eye. Next, we search in the neighbourhood of the 
initial threshold value for the most appropriate threshold 
value. This value binaries image and Circular Hough 
Transform detects circles with certain values. After remove 
most circles, the most accurate one is chosen. This circle is the 
boundary of the pupil. Finally, the outer boundary of the iris 
is identified. The performance of the proposed algorithm was 
assessed by using it to segment both low and high resolution 
images. We ran the experiments on UBIRIS vl.O also, for 
comparison with other reported accuracies. The experimental 
results show that the proposed method has a 14ms detection 
time and 97.76% detection accuracy with an overall accuracy 
of 99.025%. These results can be achieved in low resolution 
images also that show an improvement in pupil and iris 
localizing performance in comparison with well-known 
methods. 

Keywords: Eye localization; Eye tracking; Image 
analysis; Pupil and iris boundary; Robust Adaptive 
localization. 

I. Introduction 

Eye-based interaction is one of new areas in human- 
computer interaction that there has been an increased 
interest in its applications. The fundamental step in 
implementing an eye -based interface is the exact 
localization of user eye pupil and iris. Iris localization (IL) 
is commonly considered as a challenging problem and a 
suitable scheme should cope with several conditions such as 
uneven texture contrast, side effects due to the presence of 
eyelids and eyelashes, variable contrast between sclera and 
iris and between pupil and iris, some lighting points and 



light reflection of the iris as well as blurred boundaries of 
the iris. 

Currently, there are a lot of related researches about iris and 
pupil localization. [1] Each research has different technique 
for extracting and localizing of iris and pupil. The most of 
these methods use two circles or ovals to approximate the 
boundaries of the iris; one for the inner (depending on the 
application type, it can be the pupil area or the iris area) and 
the other for the outer boundary (iris area or sclera region 
can be considered). The estimation of the location and size 
of these two areas is the focus of iris localization. This 
paper presents a fast and robust IL algorithm based on the 
Histogram and Hough transform. This method first detects 
the pupil boundary circle then localizes the outer iris 
boundary. To verify the performance of the proposed 
method, we used a well-known eye database (UBIRIS) and 
performed some basic experiments on it. We show in this 
experiments that the proposed method has acceptable 
accuracy and its processing time is shorter than previous 
methods. Then, we apply the proposed algorithm to an eye 
tracker, along with an low resolution camera, and we verify 
the effectiveness of this approach. 

The organization of the rest of this paper is as follows: 
Section 2 introduces related works. Section 3 presents the 
proposed algorithm. Section 4 gives the experimental and 
comparative results of the proposed method and the 
conclusion comes in Section 5. 

II. Literature Review 

The circular pupil localization methods can be partitioned 
in three groups: template-based, cluster-based and Hough 
transform (HT) based techniques. 

The template-based method usually uses complex 
parametric equations. In 1993, Daugman [2] presented an 
IR algorithm whose IL was based on an integro -differential 
operator to locate the pupillary and iris boundaries. The 
operation of his method was based on the assumption that 
the pupil has circular edges. Daugman' s method is effective 
in high quality iris images, but its performance decreases in 
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presence of specular reflection and in case of low contrast 
images. Camus and Wildes [3] proposed a similar algorithm 
to Daugman's one. This algorithm may fail under noisy 
conditions or the presence of light reflections. 
Some recently works on template-based pupil localization 
whose core is the Daugman's method[4,5] or combination 
of some template -based techniques are as follows: integro- 
differential operator by Sanchez- Avila et al.[6,7], 
morphological and threshold technique by Kun Yu et al[8] 
and Kennell et al. [9], bisection technique by Lim et al. 
[10], pattern matching by Emmanvel Raj.M.Chirchi et al. 
[11], high threshold method by Bhola Ram Meena [12], 
using Laplacian of Guassian (LoG) mask R. 
Krishnamoorthy et al. [13], threshold based Freeman's 
chain code by Vatsa et al. [14], active contour model by A. 
Ross et al [15] and A. Abhyankar et al. [16], edge based 
virtual circle by Boles et al. [17], wavelets based by J. Cui 
et al. [18] and Y. Z. Shen et al. [19], threshold based ring 
mask by Lye Liam et al. [20], features optimizing with PC A 
and SVM classifier by K N Pushpalatha et al.[21], etc. 
Template-based methods have high computational 
complexity and parameters requires reconfiguration per 
each database. These methods are sensitive to noise and 
may fails when images do not have sufficient intensity 
separation between eye regions. 

The cluster based localization techniques typically precede 
two stages. First, the clustering and normalization and 
second, edge detection and localization of iris boundaries. 
Some researches whose used clustering includes texture 
segmentation by Kavita. [22], Hough clustering by Lili Pan 
et al. [23] GLCM pattern analysis technique by Bachoo et 
al. [24], etc. The main disadvantage of this method is that 
the cluster based localization requires more computation 
time because Preprocessing steps like clustering and 
normalization are time consuming. 



V 



Get Current Video 
Frame from Webcam 



Failed 



Detect/Track Face 
Region 



Failed 



Precise Localization 
of Eyes Position 



Read eye image from 
data base 




Fig. 1. The general block diagram of an iris and pupil 
localization system 
Figure 1 shows the general block diagram of an iris and 
pupil localization system. The blocks in grey background 
prepare the eye image. The eye image can be read from an 
eye database or captured by a camera. This stage of the 
system is not on the scope of this research and our purpose 
is focusing on iris localization that start in next stage. The 
blocks in green background show our proposed method. 
This method first locates the boundary of the pupil then 
uses it to localize the outer iris boundary. Figure 2 shows 
the details of our method. The red blocks perform pupil 
localization and yellow blocks localize iris boundary. 



The third category is the most common methodology, being 
used in several researches [25,26,27,28]. The HT based 
techniques first detects the edges in a region of interest and 
then the circular Hough transform finds the circular 
boundaries of the iris. The accuracy of HT based approach 
depend on the specific image characteristics such as 
brightness, and contrast. In addition, the edge-detector 
algorithm and its necessary tuning parameters are critical 
factors for localization accuracy. Improper adjusting of 
these parameters can lead to inappropriate boundary 
detection. As such, in this work, we propose a fast and 
robust IL algorithm based on the Hough transform that 
more accurately detects iris inner and outer boundaries. 



III. 



Proposed Method 



This section presents an adaptive IL algorithm that finds 
the pupillary and iris boundaries accurately with low 
computational cost under variable and noisy lighting 
conditions. This approach is based on the fact that the 
pupil's pixels (also the eyelashes and eyelids area pixels) 
are darker than the pixels of other regions. 



The candidate 
eye Image 



Convert to YCbCr 
colour space and 
Extract the luminance 
component (Y) 



Smoothing 
median filter 



Compute the 

primitive 
threshold (7 £ ) 




Search for a proper 
threshold (T c ) and 
binarize image 




Apply Hough 
Transform 


► 


— ► 









Select best 
boundary for pupil 



Specify the bounding 
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find the 
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Fig. 2. The block diagram of the proposed method 
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The first block captures the candidate eye image and 
second block filters its Y channel (in YCbCr colour space) 
from the region of interest. The third block applies median 
filter and the next block calculates the initial threshold 
value. This value separates the pupil pixels from the other 
part of the eye image. The next block searches for the most 
appropriate threshold value to find the boundary of the 
pupil and the last red block uses HT to detects all circular 
boundaries, finally, the algorithm chooses best circle for 
pupil. The next stage is iris boundary localization start with 
finding the edges in a bounding box around pupil. The third 
yellow block filter these edges and the next block searches 
for circles. Finally, best circle for iris boundary are chosen. 

Pupil boundary localization 

The pupil is the darkest area of the eye region and its 
shape is a circle, so it can provide important features for eye 
localization. According to this feature, we can design an 
algorithm to find the connected dark pixels that form a 
circle in an eye image. 

This section presents a technique based on HT to localize 
the boundaries of both pupil and iris. HT suffers from high 
computational cost. This problem is the result of a large 
number of search states. The new algorithm reduces the 
execution time by removing both the irrelevant edge points 
and the number of possible circles. 

First, we need an eye intensity image to localize its 
boundaries. The luminance component (Y) of YCbCr 
colour space is chosen as the eye intensity image. The 
reasons for this selection are as follows: 

1. The luminance component of YCbCr is 
independent of colour, so it can be employed to 
solve the illumination and colour variation 
problem. 

2. In comparison with the components of the other 
colour spaces, the luminance component of YCbCr 
has a relatively higher contrast compared with skin 
and the white area of the eye. 

The next step uses the CDF method to find a primary 
threshold value that separates a high percentage of eye 
pupil pixels from the rest of the eye pixels [28]. Next, it 
scans nearby primary threshold values. For each threshold 
value, the CHT searches for the pupil boundary. In most 
cases, the first detected circle is the best choice but we filter 
out some of the detected circles. This novel method is as 
follows: 

1. Consider a horizontal band of the eye image, 
starting from one fourth to four-sixths of the height 
of the eye image (Fig 6.b and Fig 6.c). 

2. Extract the luminance component of this band (Fig 
6.d). Several definitions can be used for this 
transformation. This paper calculates the 
luminance Y as a weighted sum of RGB 
components. Equation (1) shows the calculation. 



Y=0.299R+0.587G+0.11B (1) 

3. Apply a smoothing median filter on it. (Fig 6.e). 

4. Compute the cumulative histogram 'H' of the 
filtered image. The cumulative histogram can be 
found by integrating the histogram of each of the 
ROI by using the following equation: 

H(L)=Z L g=0 h(g) (2) 

Where h(g) is the histogram representing the 
probability of occurrence of intensity level 'g ' and 
0 < L < 255. Figure 3 shows the cumulative 
histogram of an eye. 




(a.3) (a.4) 
Fig.3. (a.l) an eye image from our database 
(a.2) The luminance component (Y) of the eye image 
(a.3) median filtered 

(a.4) cumulative histogram of the filtered image 



Compute the primitive threshold" '7^' where is the 
biggest value of H that is smaller than ( P\ The 
parameter 'P ' is a certain initial probabilities of an 
eye pupil pixel and determines the probability of 
the pupil pixels. This means if we calculate 
probabilities of the all eye pixels, pupil pixels have 
a value around C P '. This parameter must be chosen 
between 0.02 and 0.09. In this range, the 
sensitivity of the overall system with respect to 
this parameter is low [28]. Figure 4 shows an eye 
image after thresholding and its corresponding 'P ' 
value. 

As we can see in fig.4, using parameter 'P' the 
pupil pixels remains in eye image and other pixels 
are removed. 
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Despite the fact that this parameter is a great tool 
for extracting pupil pixels, but can't remove the 
pixels of other area permanently and output pixels 
contains undesired parts of eyebrow or other dark 
areas of eye. Therefore, we must do some extra 
steps to find pupil boundary. 
In step 6 to 9, we uses a loop and search around 
initial threshold T' (from Tl to T2) to find a 
binarized image which eye pupil boundary are 
better extracted on it. 




Main eye image 
Filtered eye image 



P=0.05 



P=0.06 




P=0.01 



P=0.07 



P=0.02 



P=0.08 



P=0.03 



P=0.09 



P=0.04 



P=0.1 



Fig.4. threshold images of an eye image with respect to 
different ( P' values. 



5. Consider the threshold T x ' where 7\ =aT t , and 
T 2 ' where T 2 = P T h and where 0.5 < a < 1 and 
1.1 < P < 1.5. There exists a nonlinear relation 
between these two parameters a and P and the 
computational cost and accuracy of our 
localization algorithm. The experimental results 



show this nonlinearity. We start the algorithm by 
choosing a = 0.9 and P = 1.1. 

6. We set T c and the '.stw' to T t and 1 respectively. 
T c is the current threshold for binarizing image and 
can change from Tl to T2. 'stw' Parameter is the 
amount by which T c is incremented each step. 

7. The filtered image is binarized according to T c 
and the result undergoes CHT to find circles with 
a radius smaller than 25% of the height of the 
filtered eye image. The radius of eye pupil circle is 
less than 25% of its height. Therefore, in this step 
the algorithm can localize pupil boundary. 

8. If the CHT method detects any circles then T a =T c 
(see Fig 6.f for binarizing the image using T a ) and 
we go to step 10, otherwise the following step 
should be performed. 

T C =T C + 'stw' (3) 
If T c < T 2 then we go to step 8 (search again), 
otherwise there is no eye in the image or the eye is 
closed and we finish the algorithm. 

9. If the previous steps detect one circle, then that is 
the boundary of the pupil. It's because of this fact 
that pupil pixels are the most dark pixels in eye 
image and by binarizing the image using low 
thresholds, other pixels are removed and remained 
area is a part of pupil. In case of detecting more 
than one circle, we filter out some circles and 
choose one of them. 

We can use the following solution in filtering and choosing 

the best circle. 

• The biggest object in the binarized image 
includes part of the pupil. We choose a circle 
that best covers this region. This circle 
contains more black pixels from the last 
thresholded eye image. Applying some 
geometrical constraints can be useful. 

Figure 5 shows the pseudo code of the pupil localization. 
The last two functions, ' filter OutBadCircles ' for removing 
inaccurate circles and 'selectBestCircle' for choosing the 
most accurate circle, performs step 10. Figure 6 shows the 
results of all steps of the algorithm. 
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ALGORITHM localizePupil ( ) 

{ 

//set initial values 

SET P to 0.05 

SET alpha to 0 . 9 

SET beta to 1 . 1 

SET stw to 1 

SET found to false 

//start localization 

mainEye Image 4- getEyelmage ( ) 

Ycomponent 4- extractYComponent (mainEye Image ) 
f ilteredlmage ^- smoothMedianFilter (Ycomponent ) 
cHist^-computeCumulativeHistogram ( f ilteredlmage ) 
Ti 4- computePrimitiveThreshold (cHist , P) 
Tl <- alpha * Ti 
T2 <- beta * Ti 

FOR Tc = Tl to T2 with step stw 

img ^- binarizelmage (f ilteredlmage, Tc) 

edgeDetectedlmage ^- cannyEdgeDetect (img) 

circles ^- applyCHT (edgeDetectedlmage) 

SET n to number of detected circles 

IF n>0 THEN 

SET found to true 

IF n=l THEN 

RETURN circles [0] 

Else 

f ilterOutBadCircles (img, circles) 
C selectBestCircle (img, circles ) 
RETURN C 
END IF 
END IF 
END FOR 

IF found=false THEN 

PRINT "no eye or eye is closed" 

RETURN null 

END IF 

} 

FUNCTION computePrimitiveThreshold (cumHist, P) 

{ 

FOR L=l to 255 

IF cumHist (L) >P THEN 

RETURN L-l 

END IF 

END FOR 

} 

FUNCTION f ilterOutBadCircles (binarizedEyelmg , 
circles) 

{ 

C ^~ the center of biggest object in binarizedEyelmg 

N ^- count of circles 

maxD =0.1 * binarizedEyelmg . Width 

FOR I = 0 to N-l 

IF DISTANCE (circles (I) . center, C) > maxD THEN 

removeCircle (I) 

END IF 

END FOR 

} 

FUNCTION selectBestCircle (binarizedEyelmg , circles) 

{ 

N ^- count of circles 
SET B to 0 
maxBlackPixels 4- 0 
bestCircle <- null 
FOR I = 0 to N-l 

B ^- the count of black pixels in circles (I) 

IF B > maxBlackPixels THEN 

maxBlackPixels 4- B 

bestCircle circles (I) 

END IF 

END FOR 

RETURN bestCircle 
} 

Fig.5. Our pupil localization algorithm pseudo code 




Fig.6. Step-by-step result of the proposed pupil 
localization method. 

a) The candidate eye image 

b) Considered horizontal band of the eye image 

c) Cropped band 

d) Extracted Y channel 

e) Applied smoothing median filter 

f) Thresholded image 

g) Edge detected 

h) Pupil localized eye image 

Iris boundary localization 

This step assumes that the previous steps have detected 
the inner circle of iris (pupil area) and it considers an area 
with the centre of the pupil circle to localize the iris. Within 
this area the edge points are detected and then the CHT 
searches for a circle with a certain radius size. The iris outer 
boundary detection is performed by using the following 
steps: 



Step 1: We define the bounding box with the centre of 
the pupil area and four times its size. 



Step 2: Find the edge points within this box by using the 
Canny edge detector. 
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Step 3: Filter out some of these edge components. (Fig 
7.c) 

Step 4: Apply CHT and then search for a circle whose 
center is within the pupil area and whose radius 'R' is 
2r pupil < R < 4r pupil (see Fig.7.b) where r pupil is the 
radius of the pupil circle. This step returns the iris circle 
(Fig.7.d). If no circle is detected we can increase R and 
search again. 





(c) (d) 

Fig.7. Step-by-step result of the proposed iris localization 
method. 

a) The candidate's eye image with pupil detected 

b) Considered bands of the iris 

c) Threshold image and some filtered edge components 

d) Iris localized eye image 



IV. Experimental results 

The algorithm is developed using the visual C#.net 2010 
programming language and the OpenCV computer vision 
library wrapper, 'EMGUCV. It is tested on the Centrino 
dual core 2.0 GHz CPU with windows XP service pack 3 
and 2 GB DDRII Ram. The 'UBIRIS VI' [29] iris database 
has been selected for experiments. This database is famous 
and consists of different eye characteristics such as 
environmental light changes, noise and, focusing problem. 
This database consists of two parts: Part 1 with 1214 
images and Part 2 with 663 images. 

Tables 1 and 2 show the variations of detection time and 
accuracy of our method (for 500 random eye images from 
UBIRIS database) with respect to different values of 
'a'and'P' and 'stw'=\ and 'P'=0.05. Figures 8 and 9 
show the results graphically. 



We can see that tuning these values can decrease the 
detection time by more than 90 ms and increase the 
detection accuracy by about 2%. Therefore, we choose 
reasonable values for these parameters. In our experiments, 
these values are a=0.9 and (3=1.1. These can be changed by 
varying 'P\ 



Table I. The effects of two parameters, 'a 9 and 'fi' f on the 
search time(ms) of algorithm. 





p 




1.1 


1.2 


1.3 


1.4 


1.5 


0.5 


39.30 


42.58 


44.56 


45.28 


47.70 


0.6 


27.97 


30.80 


32.09 


33.76 


35.54 


0.7 


21.10 


22.88 


25.28 


26.86 


27.70 


0.8 


16.66 


18.37 


19.54 


21.11 


21.73 


0.9 


14.93 


17.67 


19.92 


21.64 


22.78 



Table II. The accuracy (%) of the algorithm based on the value 
of two parameters 'a' and '6'. 





P 




1.1 


1.2 


1.3 


1.4 


1.5 


0.5 


95.8 


96.0 


95.6 


95.2 


94.6 


0.6 


96.6 


96.6 


96.0 


95.8 


95.8 


0.7 


96.2 


96.4 


96.2 


96.0 


96.2 


0.8 


96.8 


96.6 


96.4 


96.6 


96.4 


0.9 


96.8 


96.2 


96.4 


96.0 


96.2 



We tested our method on the UBIRIS database. Table 3 
shows the results of this experiment. In this experiment, we 
assume a localization error of less than 7 pixels as correctly 
localized. 



Table III. the results Of our experiment against ubiris.vi 



a =0.9 and $ 


=1.1 and stw=l 




Localization Properties 


Session 1 


Session 2 


Correctly Localized 


1188 


571 


Failed 


26 


92 


Min Localization Time 


13 ms 


14 ms 


Max Localization Time 


46 ms 


61 ms 


Mean Localization Time 


14.21 ms 


17.19 ms 


Accuracy 


97.86% 


86.12% 


a =0.9 and /? 


=1.1 and stw=2 




Correctly Localized 


1162 


561 


Failed 


52 


102 


Min Localization Time 


5 ms 


7 ms 


Max Localization Time 


18 ms 


23 ms 


Mean Localization Time 


7.34 ms 


8.01 ms 


Accuracy 


95.71% 


84.61% 
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As we can see in Table 3, with a small step width, the 
overall accuracy of our method in session 1 is 97.86% and 
in session2 is 86.12% and the proposed method 
successfully localized 1188 images out of 1214 in sessionl 
and 571 images out of 663 in session 2. The worst result in 
session2 is due to the very high light variations and poor 
intensity separability between sclera, iris and pupil in the 
images of session 2. In this situation, the localization time 
of our algorithm is less than 16 ms for a=0.9 and /?=1.1. 
This time is appropriate for online processing with a frame 
rate of more than 60 frames per second. By increasing the 
value of the 'stw' parameter to 2, the localization time 
improves around twice and the accuracy of the method 
decreases about 2 percent. 





p =11 




B = 1.2 




P = 13 




P = 14 




p = 15 




Fig.8. The detection time (ms) of our localization 
algorithm with respect to variations of 'a' and '/?'. 




0.5 0.55 0 6 0 65 0.7 0.75 0 8 0 85 0 9 

[ a ] 

Fig.9. The accuracy(%) of our localization algorithm 
with respect to variations of 'a' and '/?'. 

The next experiment describes a localization error to show 
the accuracy of the algorithm by Equation (4). 

detection error — — * 100 (4) 



In Equation (10), R and C are the actual radius and centre 
of the pupil boundary. 'R' and 'C are also the radius and 
centre of the localized area for the proposed algorithm and 
W is the width of the given eye in pixels. We define the 
efficiency of the algorithm for a given detection error as the 
number of eye images less than the detection error of our 
method. Table 4 shows the efficiency of the proposed 
algorithm against the two sessions of the UBIRIS database. 
This experiment assumes that W is equal to the width of 
the given eye image. The width of all images in the UBIRIS 
database is 200 pixels. 

Table IV. THE EFFICIENCY OF THE PROPOSED ALGORITHM 



Detection 
Error 


Session 1 

(1188 true localized) 


Session 2 

(571 true localized) 


count 


percentage 


count 


Percentage 


0.0% 


408 


34.34% 


327 


57.27% 


0.5% 


359 


30.22% 


89 


15.59% 


1.0% 


207 


17.42% 


66 


11.56% 


1.5% 


146 


12.29% 


38 


6.65% 


2.0% 


36 


3.03% 


19 


3.33% 


2.5% 


19 


1.6% 


23 


4.03% 


3.0% 


3 


0.25% 


6 


1.05% 


3.5% 


10 


0.84% 


3 


0.53% 


4.0% 


0 


0% 


0 


0% 



The results show in low detection errors the efficiency of 
algorithm is good and the overall efficiency is high. Figures 
10 and 11 show respectively results of inaccurate iris 
localization in session 1, and session 2. 













V 









Fig.ll. inaccurate localization in session 2 
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Table V. THE ACCURACY RATE AND THE EXECUTION TIME OF THE PROPOSED METHOD 



Method 


Parameters 


Session 1 


Session 2 


Time for 
sessionl 
(ms) 


Time for 
session2 
(ms) 


Implementation Notes 


[3] 


Hysteresis Thresholds: 
Hi=50, Low=44, 
Gaussian Kernel 
Dimension=5 


86.64% 


73.26% 


3600 


3600 


• Cases of segmentation failure are determined 
visually 


[30] 


Constant 
parameters 


98.13% 


87.48% 


650 


650 


• 200 iris images selected from Session 1 are used for 
training 

• the remaining 1014 images are used for testing. 

• the parameters learned are applied directly to 
Session 2 without any adjustment. 


[31] 


Constant 
parameters 


98.43% 


85.82% 


800 


800 


• the accuracy of the localization is determined 
visually. 


[32] 


4x downsized ratio 


95.46% 


87.03% 


2970 


2970 


• The assessment of the algorithm was done by 
visually inspecting 

• The images have been reduced in size by a factor of 
4 to increase the speed of the algorithm. 

• After some modifications, the accuracy obtained for 
session 1 is 92.36% and for session2 is 83.96%, with 
an average segmentation time of only 0.48 seconds. 


ours 


a=0.9 

0=1.1 
stw=l 


97.87% 


86.11% 


14.21 
[13-46] 


17.19 
[14-61] 


• Are fully reported in the experimental result section 



Next experiment compares the performance of the proposed 
algorithm with Wildes [3] (a classic HT based method) and 
three other method introduced in [30] [31], [32]. Table 5 
shows the accuracy rate and the execution time of the 
proposed method. The best rating in every item is marked 
by green color. The accuracy rate of our method is much 
higher than Wildes' method but, it is less than [30], [31] in 
sessionl and [30], [32] in session2. However, in terms of 
speed, the proposed method is much faster. This is because 
of the different proposed steps that applied to reduce the 
searching space in CHT method. The reported performance 
of Wildes' method was extracted from [30]. 



The next experiment tests this method in an online eye 
tracking system by using a Microsoft LifeCam VX700 
webcam for capturing frames. Each frame of the input 
video is processed by three different AdaBoost classifiers to 
detect the locations of the user's face and their left and right 
eye. We use a set of simple geometrical constraints to 
verify the detected regions. Sometimes the user turns 
his/her face without placing it in the frontal position. In 
these cases, the classifier fails to detect the face location 
and, we use the Camshift algorithm [23] to track the face, 
which gives us the rotation angle of the head. Our method 
supposes that the eyes are located in the second quarter of 
the head and we use the head angle to approximate their 
location. Table 6 shows the average execution time of each 
part for a single frame. The pupil localization part is very 
fast and its execution time is about 13 milliseconds. 



Table VI. THE AVERAGE EXECUTION TIME OF EACH PART 



Algorithm Parts 


Time(ms) 


Face detection (Using Haar Cascade Classifier) 


27.91 


Face tracking (Using Camshift) 


14.47 


Left eye detection (Using Haar Cascade Classifier) 


30.20 


Left Eye tracking (Using Haar Cascade Classifier 
and some simple geometrical constraints) 


16.83 


pupil boundary detection 


13.2 


total average execution (Face & Eye Tracking + 
Eye pupil localization ) 


44.5 



Another experiment was performed with a total of 10 
participants (8 males and 2 females) with a wide age range 
and some of them wearing glasses. The participants sat 
approximately 60 cm from the camera. A black screen with 
a moving red point was shown to them (Figure 12). The red 
point was randomly moved on the screen and we asked the 
users to keep their heads steady and gaze and track this 
point. Our application captured 9000 frames (5 minutes 
with 30 frames per seconds) for every person and processed 
these frames in real time. Table 7 shows the results. 

The detection accuracy and the overall detection accuracy is 
calculated using (5) and (6), respectively. 

TP 

Detection Accuracy = * 100% (5) 

J TP+FN v ; 

TP+TN 

Overall Accuracy = rrniI7ni * 100% (6) 



TP+FP+FN+TN 
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TABLE VII. THE DETECTION ACCURACY AND THE OVERALL DETECTION ACCURACY OF OUR METHOD 







Left Eye Was Opened 




Left Eye Was Closed 






Subject 
ID 


Total 
Frames 


True Boundary 
Localized 
(TP) 


Wrong Boundary 
for Pupil was 
Localized 

(FN) 


Total 
Frames 


Wrong Boundary 
is Detected as 
Pupil 

(FP) 


Reported as "No 
Pupil" 

(TN) 


Detection 
Accuracy 


Overall 

Accuracy 


1 


8683 


8546 


137 


317 


220 


97 


98.42% 


96.03% 


Z 


8615 


8526 


89 


385 


184 


201 


yo.y 1 /c 


yO.y 1 /c 


a 
j 


8256 


8166 


90 


744 


180 


564 


Q8 Q1 


Q7 nn% 

y 1 .UU /c 


A 


8511 


8434 


77 


489 


103 


386 


QQ 1 (\c/„ 
yy. L\J /c 


Q8 nn^ 


J 


8727 


8689 


38 


273 


54 


219 


QQ ^O/r, 

yy.DO /c 


Q8 Q2C& 
yo.yo /c 


u 


8495 


8317 


178 


505 


131 


374 


Q7 QO% 
y i .y\j w 


yyj.j i w 


7 


8686 


8573 


113 


314 


116 


198 


98.70% 


97.46% 


8 


8508 


8472 


36 


492 


41 


451 


99.58% 


99.14% 


9 


8414 


8398 


16 


586 


57 


529 


99.81% 


99.19% 


10 


8470 


8412 


58 


530 


97 


433 


99.32% 


98.28% 


Total 


85365 


84533 


832 


4635 


1183 


3452 


99.025% 


97.76% 



TP: Number of frames with correctly localized eye pupil boundary (true positive) 

FN: Number of frames with open eye which the method did not localize (false negative) 

FP: Number of frames with closed eye that are a circle and are localized as pupil (false positive) 

TN: Number of frames with closed eye that are correctly reported as no pupil (true negative) 



WSBM 






Fig. 12. A participant sat in the front of camera and a black screen 
with a moving red point was shown to him 

Because, the iris capture devices are mostly exposed to 
natural illumination or other variant circumstances, 
sometimes these conditions affect the quality of the iris 
images and further impact on the localization result. We 
used this method in three lighting conditions and Figure 1 3 
presents the experimental results. It shows that variant 
lighting conditions do not affect the overall performance 
and accuracy of the algorithm. 

The experimental results show that the proposed method 
has a 97.76% detection accuracy with an overall accuracy 
of 99.025%. These experimental results are promising and 
show an improvement in pupil and iris localizing accuracy 
and execution time in comparison with similar well known 
methods. 



Fig. 13. step by step results of proposed method used in eye 
tracking system for three light conditions. (a):Dark condition 
(b):Median lighting condition (c):Bright condition 
(a.l, b.l, c.l) The user eye image 

(a.2, b.2, c.2) Cropped horizontal band of the Y component 

(a.3, b.3, c.3) Applied smoothing median filter 

(a.4, b.4, c.4) Threshold image 

(a.5, b.5, c.5) Edge detected 

(a.6, b.6, c.6) Pupil and Iris localized eye image 
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From the experiments, some situations occasionally 
decrease the accuracy of our algorithm. The first is the 
result of a swift movement of the head. In this situation, our 
algorithm is unable to correctly localize the eye. This is 
because the eye images are blurry so that the skin 
colours blend with the colours of the eye areas. The 
other situation is when the user bows his/her head or 
changes the focus to the lower area (with respect to the 
camera position) so that the eyelids are captured partially 
close. Other reasons for the failure of our method are high 
reflections, high variations of luminosity and poor focus. 

V. Conclusions 

This paper proposed a fast iris localization method based on 
CHT. The proposed method consisted of these steps: first, a 
horizontal part of the eye image was considered and its 
luminance version (Y component) was extracted. Second, a 
smoothing filter was applied on it and a primitive threshold 
was obtained. Third, in the neighbourhood of this threshold 
value, a certain threshold value for localizing the pupil 
boundary was considered and, finally, by the use of a 
localized pupil circle, the iris region was detected. The 
proposed approach improved both localization time and 
accuracy in comparison with well-known methods. 
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Abstract — With the advancement of technology in fields of 
wireless communication, different mobile communicating devices 
equipped with variety of embedded sensors and powerful sensing 
have been emerged. Participatory sensing is the process that 
enables individuals to collect, analyze and share local knowledge 
with their own mobile devices. Although the use of participatory 
sensing offers numerous benefits on deployment costs, 
availability, spatial- temporal coverage, energy consumption and 
so forth, it has certain threats which may be compromise the 
participator's location and their trajectory data. Henceforth, to 
ensure the participators' privacy is the most urgent task. The 
existing proposals emphasized more on participators' location 
privacy and very few of them consider the privacy of the 
trajectories. The theoretical mix zones model are been improved 
by considering time factor from the viewpoint of the graph 
theory and mix zone graph model has been presented. This model 
considers only sensitive trajectories for providing privacy thereby 
reducing overall information loss and storage space. Further, 
instead of defining single mix zone graph model, multiple mix 
zones are created in order to enhance the privacy of the 
participator's trajectories. 

Keywords- Location privacy, mix zone graph model, multiple 
mix zone, participatory sensing, trajectories. 



/. Introduction 

THE growth of mobile phones along with their pervasive 
connectivity leads to the development of a new sensing 
technology model called as participatory sensing[l] 
systems. Here mobile devices carried by the individual's acts 
as a sensor thereby eliminating the need of deploying sensors 
at particular areas. Participatory Sensing facilitates the 
participator to sense, analyze, collect and share the sensed 
information from their surrounding environment using their 
mobile phones. For example mobile phones may report actual 
(continuously) temperature or sound level; likewise, vehicles 
may notify about traffic conditions. 

The vast amount of trajectory data gets collected and 
progressively increases as the participators sense the data. 
Trajectories are defined as the path followed by the moving 



object which is generally represented by (x, y, t) where x and y 
are the location coordinates and t denotes the timestamp. In 
typical participatory sensing applications, the data reports 
generated as an output may reveal participators' spatial 
temporal information. Adversary can obtain some valuable 
results from the published trajectories. The collected data may 
be used to deduce private information about the user. So to 
ensure the participators' privacy is the most urgent task. The 
gathered information is very crucial to the participatory 
sensing systems as their deficiency endangers the success of 
such systems. Therefore the need is to preserve the privacy of 
the participatory sensing users by protecting their trajectories. 

Mix Zone Graph Model [2] is one of the existing approach for 
providing privacy to the trajectories of the participators. A mix 
zone is a region where no applications can track user 
movements. It is the region where the users can change their 
pseudonyms without being observed by the adversaries. A 
pseudonym is a uniquely generated random number. Each 
participator enters a mix zone with a pseudonym and exits the 
mix zone with other pseudonym. The use of pseudonym 
breaks the continuity of a user's location exposure thereby 
protecting the future locations of the users. However, existing 
mix zone model solutions mainly focus on the development of 
single mix zone. Henceforth, for providing more security 
multiple locations are selected for applying mix zone graph 
model. Thus, multiple mix zone model [3] is used for 
providing maximal privacy to the trajectories of the 
participatory sensing users. 

A. Location Privacy 

Location privacy is defined as the ability to prevent other 
unauthorized parties from learning one's current or past 
location. Traditionally, privacy of personal location 
information has not been a critical issue but, with the 
advancement of location tracking systems capable of following 
user movement twenty-four hours a day and seven days a 
week, location privacy becomes crucial: records of everything 
from the particular rack a person visit in the library to the 
clinics a person visit in a hospital can represent a very invasive 
list of data. Numerous systems could figure out the location of 
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person. One of several original systems designed for position 
following could be the Global Positioning System (GPS). This 
technique makes use of satellites to aid devices figure out their 
own position. Generally, automated digital devices obtain 
information either through communication, observation, or 
inference. 

B. Trajectory Privacy 

A trajectory is the path that a moving object follows through 
space as a function of time. Example of trajectories could be 
monitoring of wild animals, birds, people, a soccer player, etc. 
Trajectories may be uni-dimensional or perhaps multi- 
dimensional. Participatory sensing systems primarily depend 
on the collection of information across large geographic areas. 
The sensor data uploaded by participators are usually tagged 
with the spatial-temporal information when the readings were 
recorded the published trajectories for decision making. For 
example, merchants may possibly decide where to build a food 
store that could produce maximum gain by analyzing 
trajectories associated with consumers in a selected spot and 
also the Department of Transportation can make an optimized 
vehicle scheduling strategy by monitoring the trajectories 
connected with motor vehicles. However, it will add 
considerable threats to the participators' privacy. Adversary 
may perhaps examine the particular trajectories which contain 
abundant spatial-temporal background information to be able 
to link numerous reports that are collected. Hence, it is crucial 
to be able to unlink the particular participators' identities from 
sensitive data collection locations. 

C. Existing Technique Limitation 

TrPF, Trajectory Privacy Preserving Framework for 
Participatory Sensing Applications, is an existing approach 
which preserves the trajectories of the participators by 
applying Mix zone Graph Model at a single sensitive location. 
The problem here is that if an adversary is successful to guess 
the pseudonym of this single location Mix Zone Graph Model, 
the whole trajectory can be inferred. 
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E. Our Solution 

In this paper, we propose an approach for preserving 
trajectories of the participators by applying Mix Zone Graph 
Model at multiple locations thereby enhancing the privacy 
level of the participators. Further, due to cost constraints, not 
all point of interests can be considered as the candidates for 
applying Mix Zone Graph Model. So the solution for selection 
and placement problem of the number of mix zones to be 
considered is been addressed here. 

F. Our Contribution 

Our contribution in this paper is as follows :- 

• To secure location and trajectory privacy of the 
participatory sensing user by applying Mix Zone 
Graph Model. 

• To secure multiple sensitive locations of the 
participatory sensing user. 

• To prove that privacy of the user can be enhanced by 
protecting multiple sensitive locations instead of 
single sensitive location. 

The remainder of this paper is organized as follows. Section II 
discusses about the related work. In Section III, 
implementation details are provided. Section IV discusses 
about the result work. Finally, the paper is concluded and 
future work is been given in Section V. 

II. RELATED WORK 

In the literature there exist several approaches to protect the 
particular position of the user. Some of them are discussed 
below- 

A. Location Privacy Protection 

There are several works that analyze the location privacy 
preserving schemes. They can be classified into the following 
aspects. 



D. Our Observation 

Instead of applying Mix Zone Graph Model at single sensitive 
location, multiple locations can be considered as the 
candidates for applying Mix Zone Graph Model. As the 
number of locations increases, the number of pseudonyms to 
be cracked by an adversary increases. Thus the probability of 
successful attack by an adversary is reduced. An attack is said 
to be successful if an adversary is able to crack all the 
pseudonyms used in the corresponding mix zones. Consider a 
scenario where Mix Zone Graph Model is applied at three 
locations. The adversary will be able to deduce the whole 
trajectory only when he/she will be able to crack the 
pseudonyms at all three locations. Hence, as the number of 
mix zones increases the number of pseudonyms to be 
identified increases eventually increasing the privacy level. 



1) Obfuscation: It is defined as the means of intentionally 
degrading the quality of information about an individual's 
location in order to protect that individual's location [4]. 

2) Mix Networks: Mix Networks [4] uses anonymizing 
channels to de-link reports submitted by sensors before they 
reach the applications. In other words, Mix Networks act as 
proxies to forward user reports only when some system- 
defined criteria are met. Mix Network may wait to receive k 
reports before forwarding them to the application, e.g., to 
guarantee k-anonymity. However, the anonymity level directly 
depends on the number of reports received and "mixed" by the 
Mix Network. They rely on statistical methods to protect 
privacy and do not guarantee provably- secure privacy. In 
addition, there could be situations where a moderately long 
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time could pass before the desired level of anonymity is 
arrived at (when "enough" reports have been gathered). 
Accordingly, Mix Networks might strikingly diminish 
framework throughput and can't be utilized as a part of settings 
where regular reports are needed. 

3) K- Anonymity: k-anonymity is a wide- spread general 
privacy concept not limited to location privacy. It gives the 
assurance that in a set of k objects (in this case, mobile user) 
the target object is indistinguishable from the other k - 1 
object. Subsequently, the likelihood to distinguish the target 
user is 1/k. The thought behind k-anonymity is that a user 
reports a obfuscation region to a customer containing his 
position and the positions of k - 1 different customers rather 
than his exact position that is secured by a pseudonym. As an 
example consider that Alice is currently at home and queries a 
location based service for the nearest cardiology facility. 
Without utilizing anonymization, this inquiry could reveal to 
the customer implementing the service that Alice has health 
issues. By utilizing k-anonymity, Alice would be 
indistinguishable from at least k - 1 different customer, such 
that the customer couldn't link the actual request to Alice. As a 
result, it is necessary that all k customers of the calculated 
anonymization set sent to the customer have the same 
obfuscation [4] region such that the customer can't connect the 
issued position to the home location of Alice. 
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Some of the existing trajectory privacy protection schemes are 
as follows - 

1) Dummy Trajectory Obfuscation: Protecting trajectory 
privacy from a data publication viewpoint is performed with 
simple dummy trajectories obfuscation approach. This 
approach proposes to generate dummy trajectories so that you 
can confuse the adversaries. In order to confuse fake 
trajectories as well as the true ones, dummy trajectories are 
usually generated under two rules: first, the movement 
patterns of dummy trajectory needs to be similar to end users; 
second, the intersections of trajectories needs to be as more as 
possible. According to these rules, dummy trajectories are 
usually generated by rotating true users' trajectories. But the 
main drawback is to generate similar looking trajectories as 
the quality of anonymity depends upon it. 

2 ) Suppression-Based Method: It is based on the assumption 
that various adversaries may have diverse and disjoint part of 
users' trajectories. Suppression-based method decreases the 
probability of exposing the whole trajectories. Trajectory 
pieces should be suppressed, publication of these pieces may 
raise the whole trajectory's breach probability over a particular 
threshold. This technique works well by preventing the 
explosion of whole trajectories from the adversaries. But the 
main setback is that some useful data may get lost during 
suppression of trajectory data. 



4) Mix-Zones: Pseudonym is used to break the actual linkage 
between the user's identities with his/her events. This task is 
normally performed in most pre-determined areas known as 
mix zones. The task of the modify is normally performed in 
most pre-determined areas known as mix zones. A difficulty 
with this particular method is actually of which there must be 
adequate customers from mix zone to offer a acceptable level 
of anonymity. 



3) Trajectory K- Anonymity: Trajectory k- anonymization [6] 
technique proposes a scheme where every trajectory is 
generated such that a user finds it indistinguishable to guess 
the other k-1 trajectories. In this approach first, trajectories are 
clustered based on log cost metric, then each sample location 
on trajectories is generalized to a region containing at least k 
moving objects. Then trajectories are reconstructed by 
arbitarily choosing sample points from the anonymized region. 



5) Dummy Locations : This process mostly employs the idea 
of dummy locations[5] to protect the user's location privacy. 
A location-dependent issue is actually abstracted as Q = (pos; 
P), where parameter pos is actually the mobile user location 
and also parameter P denotes the user specified predicates. We 
call such a query Q the original query. While using the 
location dummy strategy, the original problem is typically 
converted into a query QO = (posl; pos2;:::; posk; P), where 
the posl include the user's real location and k- 1 dummy 
locations, and P is the original query predicate that applies to 
all k-locations. We call query QO a location privacy query, 
since it hides the user location. 

B. Trajectory Privacy Protection 



4) Trajectory Privacy Preserving Framework: This technique 
proposes the use of Mix Zone Graph Model where mix zone is 
applied over a single sensitive location. Pseudonyms of the 
participators are changed in this mix zone in order to protect 
the trajectories of the user's which can be inferred by an 
adversary. Directed Weight Graph of mix zone model is 
created where an adversary cannot map an exact relationship 
between participator's arrival time and their exit time. This 
technique works well but considers only single sensitive 
location where mix zone graph model is applied. 

C. Comparison of Existing Techniques with Proposed System: 
Several work exists where location of the users' as well as 
their trajectories are given privacy. Dummy location [7] is a 
mechanism of creating fake alias location of the user's 
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location in order to confuse the adversary. Location k 
anonymity is defined in [8] as a privacy approach designed to 
protect identification of an individual against a specific 
datasets. Another technique used for location privacy is 
obfuscation [9] where the user's location is purposefully 
altered to lower the precision of the user's spatial temporal 
information. This can be achieved using generalization or 
perturbation. Pseudonym [10] is a randomly generated unique 
identifier provided to each user before entering the sensitive 
area called as mix zones [11]. Mix zone is the area where a 
participators movement cannot be tracked by anyone. 
Pseudonym is generated to break any link present between the 
user's identity and their events. Mix networks [12] are used to 
anonymize the channels used between the links between the 
reports submitted by the user to the system. It is been observed 
that once a user's trajectory has been identified, then it 
becomes easy to derive the locations of the users. 

Trajectory privacy schemes exist in the literature. Some of 
them are as follows - dummy trajectories [13] where fake user 
location trajectories of the users are created. This technique 
provides privacy to the trajectories however the main problem 
is how to generate the exact look alike fake trajectories. 
Another technique proposed is suppression based [14] 
technique where the whole trajectories are generally 
suppressed with the assumption that the adversary would not 
be able to infer the user's information since the whole 
trajectories are not exposed. The main threat to this approach 
is that essential data may get lost during the process of 
suppression. Trajectory k-anonymization [15] technique 
proposes a scheme where every trajectory is generated such 
that a user finds it indistinguishable to guess the other k-1 
trajectories. All these techniques deal with the whole 
trajectory and thus increases the storage space cost. Not all 
locations are sensitive, so providing privacy around these 
sensitive locations can only be considered instead of whole 
trajectories [16]. To overcome the defects above, a new 
scheme has been proposed to preserve the privacy of the 
trajectories at multiple sensitive locations. 

III. EXISTING SYSTEM 

Most of the existing techniques focus on providing location 
privacy of the participators while few approaches consider 
preserving the trajectories of the user. An approach called as 
Trajectory Privacy Preserving Framework TrPF for 
participatory sensing applications has been proposed. The 
participators known as data collector sense the spatial- 
temporal information through their mobile device. This 
information is stored by the Report Server which generates 
data reports that are eventually stored on Application Server. 
Any authorized end user or participator can view these reports. 
Trusted third party severs are used for maintaining security to 
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end users or the data collectors. Fig.l shows the overall 
architecture of TrPF system. 
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Fig. 1. Architecture Of TrPF System 

In this approach the trajectories of the participators are 
preserved using Mix Zone Graph Model. Not all, but only 
sensitive trajectories are considered for while applying mix 
zone graph model. Firstly, a sensitive location, o, is taken as 
centre and a sensitive area is constructed around it. The 
trajectories intersecting the sensitive area are said to be as 
sensitive trajectory segments. Mix zone graph model is then 
applied on these segments. Thus, the trajectories of the 
participators are preserved. 

A. Limitations 

• Only single sensitive location is considered. 

• Requires more time to process query as only raw 
trajectories are considered. 

IV. PROPOSED SYSTEM 

A. Architecture 

The existing solution considers only a single sensitive location 
while constructing mix zone graph model. This leads to the 
lack of a systematic approach for global privacy protection. 
Henceforth to overcome this drawback, the proposed system 
defines multiple sensitive locations around which multiple mix 
zone graph model will be applied. Not all point of interests can 
be considered as the candidates for applying mix zone graph 
model. The main reason for this the available cost constraints 
which eventually limits the number of mix zones that one 
could deploy. So the problem is to address the multiple mix 
zone graph model's placement. This is an optimization 
problem. 

The proposed system can be explained as follows which is 
shown in Fig. 2 - Firstly, the data collectors sense and provide 
their spatial temporal information to the Server using their 
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mobile phones. Consider a participator provide their current 
location(x,y) using GPS embedded in mobile phones. As the 
participator moves his/her locations get stored on the server, 
eventually forming location traces i.e trajectories. These 
trajectories must be preserved from an adversary in order to 
the preserve the privacy of the participators. Mix Zone Graph 
Model technique has been used for providing privacy to the 
participator. In proposed system we consider multiple 
locations as the candidates for applying Mix Zone Graph 
Model. Due to cost constraints, not all locations can be 
considered as point of interests where the model can be 
applied. Hence selection and placement of multiple locations 
to be considered for applying Mix Zone Graph Model is the 
problem to be addressed whose solution is given next. After 
receiving multiple locations as an output of Multiple Mix 
Zone Placement Model, Mix Zone Graph Model is applied at 
all these locations. Meanwhile, an end user may query on this 
data store on the server and server may provide appropriate 
result. 
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approach first finds the points (vertices), whose removal 
makes the graph disconnected. Such points are called as 
articulation points. This partitions the graph into disconnected 
components thus eliminating the need of pair wise connections 
between them. To refine the quality of solution further, the set 
of independent vertices are found. These are the vertices that 
are not adjacent to each other. Finally, the number of mix 
zones are limited by the given cost constraint. 
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Fig 2. Proposed System Architecture 

For instance, consider a scenario of Online Car Booking 
System where end users i.e the customers at any time can book 
a car online through the system. The administrator depending 
on the availability of the drivers assigns a driver to the 
customer. The customer at any time can track their assigned 
driver. Considering the privacy of the driver, not all location 
trajectories of the driver should be visible to the customer. The 
driver who is the participator in this system provide their 
locations to the administrator using their mobile phones. 
Trajectories of driver's are stored on the server which can be 
viewed by the administrator and driver itself. No the other 
party should be able to view the whole trajectory of a driver. 
Hence, Mix Zone Graph Model is applied at multiple sensitive 
locations of the driver thereby not allowing the customer to 
view the whole trajectory of a driver. The sensitive locations 
of the driver like his house, hospital, gym, work place, etc. 
must not be able to be known by the customer or an adversary. 
Applying Mix Zone Graph Model at multiple locations 
prevents an adversary or an end user from inferring the whole 
trajectory of the participator. Thus the trajectory privacy of the 
participatory sensing user is preserved. 

B. Solution - Multiple Mix Zone Placement Model 

This approach generally determines the number of positions 
where Mix Zone Graph model has to be applied. Basically this 



Consider the following Graph G = (V,E) where vertices V 
represents Points Of Interests of a participator and E 
represents the road segments connecting POIs. The first step is 
built on the observation that partitioning G into several 
disconnected components is helpful to eliminate the pairwise 
connections across these components. Therefore, we are 
seeking for vertices whose removal disconnect the graph. Such 
vertices are typically referred to as articulation points in graph 
theory. Take the area graph in Figure 3 as an example. Any 
route from 1 to 9 or from 1 to 12 needs to go through vertices 
6 and 10. Therefore, 6 and 10 are articulation points in this 
graph. If a mix zone is deployed at vertex 6 or 10, a 
pseudonym appears at any vertex in the bottom part of the 
graph cannot appear at vertices 9, 12, and 11. Hence, the total 
number of pairwise associations is reduced. 




T 



^5 
1 o A 



Fig. 3 Point of Interests Graph 

After G is partitioned into disconnected components, the mix 
zone deployment in each component is further refined to 
improve the solution quality. In graph theory, an independent 
set refers to a set of vertices that are not adjacent to each other. 
Hence, if all vertices that are not in an independent set are 
selected as mix zones, there will be no pair wise association 
between the vertices in the independent set. Again, refer to the 
bottom part of Figure 3 as an example. Circle highlighted 
vertices, {1, 8, 3, 5}, form a maximal independent set for the 
lower part of the graph. If vertices {2, 4, 6, 7} are selected as 
mix zones, a user Alice's pseudonym ux appears at vertex 1 
will not appear at any other vertex in the independent set. As a 
result, Alice's past and future locations on her trajectory are 
protected, even though her identity gets exposed at vertex 1. 
Finally, there is a need to control the number of mix zones to 
meet the cost and service constraint. At the last step of our 
algorithm, we iteratively remove the vertex that introduces the 
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least number of pair wise association increment from the mix 
zone candidate set selected by previous steps until cost 
constraint is met. 

C. Mathematical Model 

Let S = {I, P ,0} 

I = Input 
O = Output 
P = Process. 

1= {SI,GQ} 

SI = Sense Information 
GQ= Generate Query 

P = {TR, MMPM, MZGM} 

TR = Generate trajectories. 
MMPM = Determine multiple locations. 
MZGM = Apply Mix Zone Graph Model. 

O = {PR} 

PR = Provide results of the generated queries. 

Fig. 4 represents the mathematical model of overall proposed 
system. 




Fig.4. Mathematical Model 
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D. Algorithms Used 

1 ) GraphConstruct Algorithm 

This algorithm is used to construct mix zone graph model 

which is represented by a graph G(V,E). A mix zone graph 

model has been proposed such that Directed Weight Graph 

(DWG) is represented by G = (V, E) , where, 

V represents set of vertices that are constructed as the 

pseudonyms. 

E represents set of edges that represent the participators' 
trajectory mapping from the ingress to the egress in the 
sensitive area. 



Algorithml GraphConstruct 

Input :- Trajectory Tr and pseudonym set P. 
Output : -Directed Weight Graph (G). 
1 : Procedure 

2: Define sensitive location and construct sensitive area 
around it such that Si = {o, r} where o is sensitive location and 
r is the radius. 

3: Determine the set of sensitive trajectory segments Tf. 

4: Randomly select ingress pseudonym Pi and assign it to the 

vertex Vi. 

5: Randomly select egress pseudonym Pj such that Pj ^Pi and 

assign it to the vertex Vj. 

6: Construct Edge Eij such that Eij -> (Vi,Vj) 

7: Assign weight Wij to each edges using Weight Construct 

algorithm. 



2) Algorithm 2:- Weight Construct 

This algorithm is used to find weights of the edges formed in 
the graph of the mix zone graph model. Here, 
Vi represents participator entering the mix zone. 
K represents total number of participators entering mix zone. 
Pi represents ingress pseudonym of a participator. 
Pj represents egress pseudonym of a participator. 
tingress(Vi) represents time at which participator enters the 
mix zone. 

tj to tj+1 represents time interval during which participator 
exists from the system. 

P(Vi,t) represents the probability that a single participator 

exits the mix zones between time interval[tj,tj+l]. 

The participator Vi generally takes tj -tingress(Vi) to tj+1 - 

tingress(Vi) time in mix-zone for data collection. 

A' t represents data collection time in mix zone. 

is the probability density function (PDF) of data 

collection time in mix-zones. 

Therefore, 
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The above mentioned equation represents probability of a 
single participator exiting from the mix zone. Thus, the 
probability for all the participators exiting from the mix zone 
is given by (2) 

P(V, t) represents the probability that all participator exits 
from the mix zone between time interval[tj,tj+l]. 



Output:- A set of at most NP selected mix zone positions 
1 : Procedure 

2: Find articulation points in the given graph. 
3: Find maximal independent set. 
4: Maintain Cost Constraint. 



P(V , t) = (2) 

However only one of them is a real participator. Hence, the 
probability that the participator Vi exits in the time interval 
[tj,tj+l] is denoted by P(Vi [tj ,tj+l]) is given by the 
following conditional probability- 

P(Vi [tj ,tj+l]) = - f (3) 

Wij is given by Wij= P(Vi [tj ,tj+l]) such that wij is between 
Otolandi [l,k] and 

(4) 



V . EVALUATION AND RESULTS 

Here, the real time data is taken as an input for the system. As 
explained prior, the participator is providing their location 
(x , y) and timestamp ( t) to the Server using the GPS of their 
mobile phones. The trajectories are stored in the form of (tid, 
x, y ,t ) on the server where tid represents the trajectory ID. 
Sensitive locations are considered around which Mix Zone 
Graph Model is applied. Meanwhile the end user can access 
the relevant data on the server . The Online Vehicle Booking 
System is built as an website using C# ASP.Net whereas the 
participator provides its spatial temporal data to the Server 
using Android mobile devices. Participator side module is 
developed using Android Programming in Java. 



The Weight Construct algorithm is given as follows :- 



Algorithm 2 WeightConstruct 



Input :- tingress and Ategress=[ tj,tj+l] and A* t 
Output:- Edge Weight W 
1 : Procedure 

2: Determine the probability P(Vi,t) of single participator 
exiting mix zone in given time interval. 

3: Determine the probability P(Vi',t) for all participator exiting 
mix zone in given time interval. 

4: Find the probability of a single participator exiting the mix 

zone model denoted by P(Vi [tj ,tj+l]) 

5: Assign P(Vi [tj ,tj+l]) as the weight of the edge. 



3) Algorithm 3:- Multiple Mix Zone Placement Model. 

This algorithm generally determines the number of positions 
where mix zone graph model has to be applied. Basically this 
algorithm first finds the points (vertices), whose removal 
makes the graph disconnected. Such points are called as 
articulation points. This partitions the graph into disconnected 
components thus eliminating the need of pair wise connections 
between them. To refine the quality of solution further, the set 
of independent vertices are found. These are the vertices that 
are not adjacent to each other. Finally, the number of mix 
zones are limited by the given cost constraint. 



Algorithm 3 Multiple Mix Zone Placement Model. 



Input :- A graph G and Z. 



This work aims in proving that the privacy level of a 
participator can be enhanced by applying Mix Zone Graph 
Model at multiple sensitive locations instead on single 
sensitive location. This can be proved by measuring the rate of 
successful attacks on single mix zone as compared to multiple 
mix zones. An attack is successful if the adversary finds out 
the corresponding pseudonym used by a user in the side 
information. The success rate of an adversary is the ratio of 
number of successful attacks over total number of attacks. 
Fig. 5 shows the attack success rate when different number of 
mix zones is applied where X axis represents number of mix 
zones to be deployed at various sensitive locations and Y axis 
represents the rate of successful attack. 
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Fig. 5. Comparison of Privacy Level 

The above graph shows that the rate of successful attack is 
high when number of mix zones is less. It shows that as the 
number of mix zones increases eventually the rate of 
successful attack decreases thereby improving the level of 
privacy. The reason for this is on increase in number of mix 
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zones successful attack rate decreases because the adversary 
has to crack the corresponding number of pseudonyms in 
order to deduce the whole trajectory. This becomes 
sustainably simpler for an adversary with single mix zone as 
only one pseudonym has to be cracked. So as the number of 
locations where mix zone graph model has to be applied 
increases, the privacy preservation of trajectories increases. 
Thus, the proposed scheme offers better privacy as compared 
to the existing systems. 

Another advantage of the proposed work is that it requires less 
storage space as compared to the existing techniques. Previous 
work like Dummy trajectories and trajectory k-anonymity 
stored all trajectories for providing protection. Given t 
trajectories and each trajectory contains N segments then the 
storage space required will be 0( N* t ) to store total t 
trajectories. Whereas trajectory mix zone graph model 
approach requires only pseudonym to be stored. Only sensitive 
trajectory segments are considered here and not all 
trajectories. Hence storage space required for this approach is 
quite less as compared to the previous work. Further, the 
increase of trajectories may not affect the number of 
pseudonyms too much. By comparison, our proposal has 
lesser storage memory than that of the other proposals. 



VI. CONCLUSION AND FUTURE WORK 

Participatory sensing leverages the ubiquity of mobile phones 
to open new perspectives in terms of sensing. The analysis has 
revealed that virtually all applications capture location and 
time information. The collected data is been stored in form of 
the trajectories. The privacy of these trajectories needs to be 
preserved. Trajectory Mix zone Graph model is been used here 
for providing privacy to the trajectories of the participators'. 
This approach proposes multiple sensitive locations to be 
considered for applying Mix Zone Graph Model as opposed to 
single sensitive location. The results proves that applying mix 
zone graph model at multiple sensitive locations as compared 
to single sensitive location increases the privacy level of the 
participator. Hence the proposed system provides better results 
as compared to the existing techniques in terms of increased 
privacy level and reduced storage space. In future, mix zone 
graph model can be applied on multiple sensitive locations of 
semantic trajectories. 
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ABSTRACT 

Big Data is usually so large and complex, and it has 
become an emerging hot topic in network security fields. 
How to deal with lots of safety data which are produced 
by heterogeneous network security devices, how to 
analyze and coordinate such Big Data network events 
must be studied to date [1-3]. This paper develops a kind 
of Big Data Analytics Security framework using 
collection, analysis and integration, event-correlation and 
scenario analysis technique to process the raw data 
gathered from Big Data infrastructure (BDI). 

Keywords: Big Data Security, Big Data analytics, 
Intelligence Analysis, Intrusion detection 

I. INTRODUCTION 

As Big Data systems play increasingly vital roles in 
modern society, the security of Big Data behavior 
become more and more important. The relevant security 
techniques such as firewall, anti-virus, VPN, IDS and 
Security audit have been developed to protect against the 
security threats. However, much more issues are exposed 
in the practice of security implementation and 
deployment. There is a lack of an effective method to 
analyze the security events generated by various network 
equipment in BDIs. 

The use of multiple and diverse sources producing 
huge amounts of data calls for the research of new 
solutions for monitoring and analysis, able to timely and 
efficiently recognize ongoing malicious activities in 
BDIs. The age of Big Data, network security monitoring 
systems need to meet several requirements. First, we 
must find abnormal alarm as soon as possible. This can 
take early measures to avoid or mitigate the impact on 
services. This is actually an application of trend 
forecasting. Secondly, we must make a correct diagnosis 
of alert information from the massive, extract the real, 
non-redundant information, in order to find out the root 
of the problem, and to solve the problem. 



Sensitivity and reliability is a pair of contradictory of 
network security monitor in Big Data environment, and at 
the same time, the detection system's network level and 
performance decide their information acquisition and 
analysis limitations. We obtained information on these 
alerts, it is inevitable there will be some missing, and it 
can be applied to Big Data missing information predict, 
to get a true picture of network attacks or abnormal. 

Big Data can give full play to its externality and 
generate much larger than the sum of the huge value 
through fusion with some related data cross. For network 
security monitoring in Big Data environment, the most 
important step is the association of events. There are 
many researches on traditional event correlation, such as 
rule-based correlation, Bayesian network inference, 
model-based reasoning, and filtering, case-based 
reasoning artificial neural network reasoning. 

In this context we propose a framework comprising 
visual analytics components which are coordinated by 
modules working on behalf of a security analysis. This 
paper is structured as follows: In Section II, we present 
the big data analytics security needs and challenges. In 
Section III, we explain our proposal framework design to 
Big Data Analytics for Security (BDAS). We conclude 
the paper and present future work in Section IV. 

II. NEEDS AND CHALLENGES 

At present, there are more and more attention and 
research on network security monitoring, such as 
intrusion-detection based on data analysis and integrated 
management of network security [4, 5]. However, most 
current works are trying to research and develop a 
network security monitoring system based on partial view 
of the whole network. But we still found following 
outstanding issues during the practice of threats detection 
in these systems: 

1. During the inchoate construction of BDIs, in 
various security equipment, there is a lack of 
communication ability of management and security 
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information with others. This shortage reduces the 
holistic system efficiency and increases the cost of 
discovery-time and response-time. 

2. The detection and reporting of security events only 
present some kind of raw data but without more 
"information" and "knowledge", as the result of lacking 
enough analysis ability. 

3. Though there are lots of researches on event- 
correlation, such as rule-based correlation, Bayesian 
network reasoning, model-based reasoning, filtering, 
case-based reasoning and artificial neural network 
reasoning, etc. But in most cases, these techniques are 
only applied to single abnormity detection system, and 
also not focused on holistic BDI, including the network 
equipment, and sub-domain hosts, etc. Furthermore the 
deficiency of rapid and effective communication 
mechanism dissevers the relationship among monitoring 
components. Based on above analysis, this paper 
develops a framework for big data analytics security, 
which can highly detect anomalies in BDIs. 

III. PROPOSAL FRAMEWORK FOR 
DATA ANALYTICS SECURITY 

A. The goal of the proposal framework 

More recently, the focus of security has shifted to 
monitoring network and Internet traffic for the detection 
of bad actions as compared to the traditional approach of 
the detection of bad signatures. Specifically, traditional 
security is focused on catching malware by scanning 
incoming traffic against malware signatures which only 
detect limited-scope threats that have been already 
encountered in the past. In addition, the development of 
signatures lags far behind the development of attack 
techniques. Thus, techniques like intrusion detection 
systems, firewalls and anti-virus software can be easily 
rendered ineffective by attackers. This scenario has 
become more crucial in the presence of big data within 
computer networks - petabytes and Exabyte of 
information being transferred daily between nodes make 
it very easy for attackers to enter any network, hide their 
presence effectively and cause severe damage efficiently. 
These big data problems are stressed in the following 
points: 

• Corporations are now extending their data networks to 
allow partners and customers to access data in 
different ways to facilitate collaboration, hence 



making networks more vulnerable to attacks. The 
advent and extensive use of cloud and mobile 
computing have also generated new attack methods. 

• The advent of big data has seen a corresponding 
increase in the hacking skills of attackers, and evading 
traditional security measures such as signature-based 
tools, which is now a thing of the past. 

• Due to big data, it is possible to collect a relatively 
small slice of security information, e.g. network logs, 
Security Information and Event Management (SIEM) 
alerts, access records etc. Hence, damage done by new 
intrusion methods could be realized only after an 
attack. 

• Big data also prevents most security data from being 
analyzed due to its complexity, e.g. data could be 
coming from different sources. It could be stored in 
different formats on different machines, or could be 
generating too quickly to make any type of analysis 
feasible through traditional techniques, computer 
hardware and software architectures. 

Security Analytics address these issues by reinventing 
the wheel of Big Data security. It employs techniques 
from Big Data Analytics (BDA) to derive useful 
information for preventing attacks [6]. It provides the 
following unique features: 

• A more agile decision-making approach for networks 
managers with surveillance and monitoring of real 
time network streams, 

• Dynamic detection of both known and previously 
unknown suspicious or malicious behavior, usage 
access pattern, transaction and network traffic flow, 
applicable to all types of intrusion threats 

• Effective detection of suspicious and malicious 
behavior (least possible false positive rate), 

• Ability to deal with suspicious and malicious behavior 
in real-time, 

• Appropriate dashboard-based visualization techniques 
to provide full visibility (360° view) of network 
progress and problems in real-time. 

• Appropriate big data hardware and software to cope 
up with the aforementioned requirements. 

B. Framework Model 

We have developed a framework for Big Data Analytics 
Security based on above considerations. 



92 



http://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



(IJCSIS) International Journal of Computer Science and Information Security, 
Vol. 13, No. 5, May 2015 



1. DATA PRE-PROCESSING 

This module collects the raw data from the general 
data source in Big Data system, including the flow-rate, 
event logs and security logs from the data source. Most 
data collected are disorder in hybrid formats and 
descriptions, which makes the system behavior detection 
work much difficult. We use the Data Standardizing and 
Integrating Module to solve this problem. Firstly, this 
module standardizes the raw data into unified format, 
using the pre-defined pattern. Next, it filters the 
standardized data and removes redundancy from it. Then, 
it classifies the data by following the classification rules. 
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Figure 1 : The Framework for Big Data Analytics 
Security 

2. DATA ANALYSIS 

This component analyzes the data and provides as 
outputs information on (i) how to adapt the grain 
monitoring, (ii) the actions of protection should be 
carried out on the architecture. Starting from our past 
experiences on attack modeling and data analysis, we 
consider the following functional blocks. 

Data Processing: it allows the collection of raw data 
usually contain unnecessary or redundant information 
which may affect the effectiveness of the analysis [7]. 
The first step in the analysis is to be performed to retrieve 
raw data, the adoption of the filtration technique or 
coalescence event, such as those analyzed in [8]. 

Attack Modeling: this function block provides tools to 
define and statically analyze patterns of attack. The driver 
model used in this block must be able to: (i) providing a 
high degree of flexibility in the representation of many 
different security scenarios compactly; (ii) enable the 
specification of different types of constraints on the 



possible timing attacks; (iii) representing attack scenarios 
at different levels of abstraction, to focus the task of 
compliance in various ways. Typed temporal graph-based 
attack models [9] seem to be good options for the above 
requirements. They are rich in terms of time constraints 
that can be expressed. In addition, it is relatively easy to 
manipulate the definition of hierarchies of generalization 
/ specialization between different types of events. 

Conformance Checking: the main purpose of this 
functional block is the detection of cases of attack, in 
sequences of events recorded by the compliance behavior 
connected with the given set of attack patterns. The main 
requirement of this block is scalability. In real- world 
scenarios, the protection of critical infrastructure, are 
available on the online system, where we would like to 
raise an alert as soon as when an event with a "criticality" 
above the threshold is connected. It is therefore important 
to define appropriate data structures, to ensure rapid 
access to relevant information and appropriate algorithms 
that are closely associated with such structures to ensure 
rapid detection of an attack [10], [11]. Furthermore, it is 
important to identify the conditions that make the 
problem tractable from a theoretical point of view. 

In fact, recent work on case detection automata as 
temporal models in the recorded sequence of events [12], 
[13], [14] has shown that the detection time if appropriate 
in the real world can be obtained by limiting the number 
of partial solutions through a form of temporal filtering 
based on constraints. Finally, the parallelization two data 
structures and algorithms for checking compliance (see 
[15]) seem mandatory when we are targeting large data 
for the protection of security. 

Invariant-based Mining: invariants are properties of a 
system that are responsible for keeping all its executions. 
If those properties are to be violated (or broken) while 
monitoring the execution of the system, it is possible to 
trigger alarms useful to undertake immediate protective 
actions. Several studies have confirmed that it is possible 
to discover invariants of complex real- world systems 
[16], [17]. However, in our case, the challenge is to find 
invariant relationships in large data collected from the 
architecture. The exploration of base-invariant block 
intends to deal with this problem, while performing two 
tasks: i) automatic extraction of invariants flow data 
using autoregressive models and ii) detection at runtime 
when the invariant relationships are broken, to trigger 
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immediate action. A preliminary application of the 
approach on real data collected from a cloud software 
system production has demonstrated its feasibility and 
usefulness to discover the deviation of performance and 
SLA violations. [18] 

Fuzzy Logic: statistical methods cause a lot of false 
alarms. This is due to the difficulty of defining precise 
and crisp rules describing when an event is an anomaly or 
not. The boundaries between the normal and the 
abnormal behavior of a system are not clear, and deciding 
the degree of intrusion at which the alarm to be raised 
may vary in different situations [19]. Fuzzy logic is 
derived from fuzzy set theory to process the approximate 
rather than precise data reasoning and contributing 
effectively to facilitate the abrupt separation of normality 
and abnormality. [20] The degree of truth of an 
expression is not clear, and the use of fuzzy linguistic 
variables can express imprecision in measurement. In 
certain embodiments, a percentage of 99.95% for the 
attack detection accuracy was achieved [19]. 
Bayesian Inference: security monitors usually produce a 
large number of false alarms. A Bayesian network 
approach can be used on the top of the architecture for 
correlating alerts from different sources and to filter false 
notification. This approach has been successfully used to 
detect attacks flight credentials. [21] Raw alerts 
generated during the progression of an attack, such as 
violations of user IDS profile and notifications are 
correlated through a Bayesian network to identify users 
by hijacking compromise. The approach was able to 
eliminate about 80% false positives (A user is not 
compromised declared compromise) without missing any 
compromise user. 

IV. CONCLUSION & 
PERSPECTIVES 

Traditional security solutions are not capable anymore 
of encompassing the real-time big data network streams 
using traditions tools and techniques. We have shown 
how Security Analytics (the application of Big Data 
Analytics techniques to derive actionable intelligence and 
insights from streams in real-time) is rapidly becoming a 
strong need for Big Data security setups. Although the 
current adoption of analytical solutions is by no means 
revolutionary, the awareness of adoption is increasing 
rapidly. To support this cause, in this paper, we 
specifically mention the needs and challenges in security 



analytics. We next highlight the goal of our proposal 
framework to address data-driven security analytics 
issues. We then describe the main components in our 
proposal framework, which is design to apply security 
analytics techniques BDIs in order to decrease the false 
positive rate of a predictive model for many attacks. In 
our future works, we plan to implement our framework in 
a real infrastructure in other to evaluate and test the 
efficiency of our proposal. 
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Switching of code between the users for Enhancing the Security 

in OCDMA 
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Abstract: A new technique is proposed for optical code 
division multiple accesses for enhancing the security against the 
eavesdropper. Switching of the pulse spectrum of code is 
performed between the two users and switching position of the 
pulse spectrum of code varies from group to group. In this 
technique every pulse of a code does not have direct information. 
Code detection probability of individual user decreases against 
the eavesdropper. The analysis and simulation result compares 
with the exiting method MQC, RD and MDW code. 

Key Words: Random Diagonal (RD) Code , Modified Double 
Weight (MDW) Code, Zero cross correlation code 
(ZCC),Spectral Amplitude Coding (SAC) 



If the highly sensitive detector is used by the eavesdropper 
than information can be detected by the any spectral pulse of 
code of weight W, So a new technique is proposed in this 
paper to remove the MAI and the problem of long length code 
by switching a part of code between the user, then the 
information is available with few number of chip in direct 
manner. This is discussed in section II. Mathematical analysis 
of the proposed new method is explained in section III. Section 
IV shows the comparison of different method with proposed 
method. Paper ends with the conclusion . 



I. Introduction 

In optical networks, the Optical code division multiple access 
(OCDMA) is getting more and more attraction as multiple 
user shares the communication network asynchronously and 
synchronously with high security level [1-2]. In optical code 
division multiple access system, user information is 
transmitted by assigning the code address on OOK (on off 
keying) based pattern to T information bit. Information 
about a user is extracted at the receiver end by correlating the 
assign code between the users [3 -4]. Multiple Access interface 
(MAI) exists in the network due to in phase cross correlation 
property that degraded the system performance. MAI is 
reduced by reducing the in phase cross correlation property 
and detection technique. Zero in phase cross correlation in 
code of different users, eliminates the effect of the MAI. 
Security of the network against the eavesdropper is another 
issue that can be enhanced in OCDMA by increasing the 
length of code such as Modified Frequency Hopping code, 
Optical Orthogonal Codes, Modified Double Weight code 
(MDW) and MQC [8-10]. These codes suffer too long length 
of code situation and large weight problem. This requires the 
broadband source and narrow band filters. 
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II. Proposed Method 

A coding pattern is proposed in table 1.1 for W weight and N 
number of users with in phase zero cross correlation property. A 
2x2 switch of defined sequence is used between the spectral pulse 
Position of code of two users. Pulse spectrum (chip) position for 
switching between the users varies from group of two users. 
Switch Sn, S12 are the inputs of the switch Si and Switch S 2 i and 
S 2 2 are the inputs of the switch of S 2 . 

Probability of code detection of each user if all code chips 
(Weight) are to be detected than in case of zero in phase cross 
correlation code eq .1 and in case of unity cross correlation 
eq.2[8,ll] 

P ([/)= i x _i_ x .... x _A_ (1) 

Where W z are the weight with zero cross correlation in 
MDW code 

If the high performance detector is used. Each chip of Code 
carrying the information so Probability of code detection for 
zero cross correlation code in equation 4 and Probability of 
code detection in the case of the proposed design in equation 
5 

L = WN L=Length of code N=number of user 
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P(Z) = - (4) 



w-w s 

P(S) = —-^ (5) 



Table 1 . 1 Code pattern with switching of spectral pulse. 
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III. 



Mathematical Analysis of BER 



For analysis of this system we use the Gaussian approximation 
in our calculation [4,10] . This system is based on the zero 
cross co -relation so, then we only consider the thermal noise 
(Rth) and shot noise (Rsn) in respect to PIIN. 

Let CK(i) denotes the ith element of K user in this code ZCC 
than the following assumptions are made. 

a. Each light source spectrum is flat over the 
bandwidth [Vo-AV/2 , Vo- AV/2] where Vo 

is central frequency and AV is the optical source 
bandwidth in Hertz. 

b. Each power spectral component has an identical 
spectral width. 

c. Each user has nearly equal power at the transmitter. 

d. Each user bit stream is synchronized 

Each light source spectrum is flat over the bandwidth [Vo- 
AV/2 , Vo- AV/2] where Vo 

The power spectral density (PSD) of the received signals can 
be given as 

k N 

r(v)= ^Z dfe Z Cfe( ° rect(0 (6) 

k-1 i-1 



rect(i) = u 



Av , j 
v-v 0 - — (-L + 2i-2)\ 

Av , 1 
v-v 0 - — (-N + 20\ 

.t] (6) 



— u 



Where u(v) is the unit step function expressed as: 

rl, v > 0 



U W = i0, v<0 

00 00 r k k 

j G{v)dv = j P -^Yu dk Y, c k(0C,<ii)rect{i) 



k-1 i-1 



G(v) dv ■ 



Av 



ZAv v 1 
d k .W. — + 2^d k A 



Av 
~N 



dv (7) 



(8) 



The value of £k-i d k is equal to 1 then 

P, r W 



f 



G dd (v)dv 



(9) 



The photo current I can be expressed as 



r. 00 

l = l dd =K\ G dd (v)dv (10) 
J o 

The variation of photocurrent due to detection of an ideally 
un polarized thermal light can be expressed as 



P*rW- 



I = M 
(I 2 ) = 2eB{l dd ) + 
(I 2 ) = 2eB*[f™G dd (v)dv] + 



4K b T n B 
Rl 



(11) 



When all users transmitting 1 than probability of each user 
sending 1 is Vi than equation . 1 1 become 

_ PsreBM, n 4K b T n B 
( /2 ) = _!!_ [M , ]+ _L2L_ 

L K L 

The signal to noise ratio of direct detection technique is given 
by following equation 



SNR = 



Hag) 2 
(I 2 ) 



When putting all equation than new formula for SNR will be 

K 2 P sr 2 (W) 2 

SNR = 



P sr eBK 4K b T n B 
L R L 



1 SNR 

BER = 2 erfc \-ir 



(12) 
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Figl : Encoder and Decoder of Proposed Design 

IV. Result and analysis 



Typical parameters used in the calculation as below: 

Photo detector quantum efficiency (91) 0.6 

Line-width broadband source (AV) 3.75 THz 

Operating wavelength 1552 nm 

Electrical bandwidth (B) 311 MHz 

Data bit rate (R b ) 622 Mbps 

Receiver noise temperature (T n ) 300 K 

Receiver load resistor (R L ) 1 03 0Q 
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The block diagram of proposed scheme shown in fig 1 the 
simulation is done for 6 users with a weight of 2 .The width 
of each spectral chip kept 0.6 nm. The simulation is done in a 
practical environment in all, with all nonlinear effect is kept 
on. Simulation is performed for the 622Mbit/s for 40 km 
length of fiber and 1 Gbit/s for 40 km length of fiber with 



ITU standard single mode fiber (SMF).All the attenuation (a= 
0.25dB/km), Dispersion (18ps/nm) is maintained. Decoder 
side after decoding the signal, the signal covert to electrical 
by passing to the photo detector and 0.75 GHz low pass 
Bessel filter (LPF) The dark current value was 5 nA, and the 
thermal noise coefficient was 1.8xl0" 23 W/Hz for each of the 
photo-detectors. The performance of the system was 
characterized by referring to the BER and eye pattern. 



Fig. 2. Eye Diagram of 40 km for 6 users at 622Gbits/s 
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Fig. 3. Eye Diagram of 40km for 6 users at lGbits/s 



A BER of the proposed Scheme shown that scheme gives 
better performance as the number user is increasing with 
comparatively lower bandwidth due to less number of weight 
width smaller spectral widths of each chip. Shown in fig the 
BER has provided more than minimum BER (10 -9 ). Fig 5 
and 6 shows the probability of code detection for proposed 
technique and compare with the ZCC code [11] and MDW 
code. Where S represents the number switching code . The 
relation between power received at receiver end and BER 
shown in Figure7. 
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Fig. 5. Probability of code detetction Vs. Number of User for each pulse 
detection 
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Fig. 4. BER performance vs. Number of users 
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[10] Zou Wei, H.M.H. Shalaby, H. Ghafouri-Shiraz, " Modified quadratic 
congruence code for fiber Bragg-grating-based spectral-amplitude- 
coding optical CDMA system," J. Lightw. Technol. 19 , pp. 1274- 
1281,2001. 

[11]M.S. Anuar , S.A. Aljunid , N.M. Saad , S.M. Hamzah "New design of 
spectral amplitude coding in OCDMA with zero cross-correlation," 
Optics Communications 282 (14), 2659-2664 2009. 
[12] Thomas H. Shake , Member, "Security Performance of Optical CDMA 
Against Eavesdropping," J. Lightw. Technol 23pp. 655-672,2005. 
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vol. 30, no. 22, 2012 



Fig. 7. Graph between the receive power (Psr) and BER 



V. CONCLUSION 

Security against the eavesdropper is enhanced using inter 
code switch between the user with different pattern from 
group to group. This technique reduces the code length of 
code design with better performance. Probability of 
information detection from single pulse of code or from 
whole code is better than the ZCC code. The performance 
also compare with existing code. Performance of proposed 
technique is analysis with optiwave7 simulation software and 
mathematical analysis. 
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